This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Vectorize loads of consecutive memory accesses, accessed in non-consecutive (jumbled) way.
ClosedPublic

Authored by • ashahid on Nov 21 2016, 1:27 AM.

Download Raw Diff

Details

Reviewers

mkuper
mssimpso
hfinkel

Commits

rG3121334d3218: [SLP] Vectorize loads of consecutive memory accesses, accessed in non…
rL293386: [SLP] Vectorize loads of consecutive memory accesses, accessed in non…

Summary

This patch improves the capability of SLPVectorizer pass to vectorize the loads of memory accesses in jumbled manner by using "load + shufflevector" IR instructions. The jumbled scalar loads will be sorted while building the tree and these accesses will be marked to generate "shufflevector" after the vectorized load with proper mask.

Diff Detail

Event Timeline

• ashahid updated this revision to Diff 78690.Nov 21 2016, 1:27 AM

• ashahid retitled this revision from to [SLP] Vectorize loads of consecutive memory accesses, accessed in non-consecutive (jumbled) way..

• ashahid updated this object.

• ashahid added reviewers: mkuper, hfinkel, mssimpso.

• ashahid added a subscriber: llvm-commits.

Herald added subscribers: mzolotukhin, sanjoy. · View Herald TranscriptNov 21 2016, 1:27 AM

RKSimon added a subscriber: RKSimon.Nov 21 2016, 2:06 AM

Some minor comments - mainly code style etc.

lib/Analysis/LoopAccessAnalysis.cpp
1064	clang-format this
1066	This can probably be replaced with a for range loop: for (auto *Val : VL) { and then replace the uses of VL[i] with Val
1075	use auto* for dyn_cast return.
1081	newVL is unused?
1083	Use auto? You can drop the braces as well.
lib/Transforms/Vectorize/SLPVectorizer.cpp
468	This might be simplified with a for range loop and use of llvm:none_of / any_of?
1226	Is it worth breaking here once we know that shuffledLoad is false? Remove the braces if you can.
2580	for range loop?
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
3	Possibly commit this test to trunk with the current output generated by utils/update_test_checks.py?
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
4	It'd be better if there was more context to this additional shuffle - regenerate + commit the current output with utils/update_test_checks.py ? For an IR loop it's not that large an output.

This basically fixes PR28474, right? Does it work correctly on the test-cases there?

lib/Analysis/LoopAccessAnalysis.cpp
1065	Why is this a multimap?
1067	Could you add some documentation to explain what exactly this does?
lib/Transforms/Vectorize/SLPVectorizer.cpp
466	I think you may want a different name - this doesn't actually check whether the scalars are jumbled, it checks whether they're all present. That is, it'll return true even if they're all in-order.
468	Would it make sense to pre-sort both arrays, and then check the two sorted arrays are equal? This would make it O(nlogn) instead of O(n^2) (I'm not sure sort based on what, though - as well as the actual gain, since I guess VL.size() is small in practice.)
1196–1197	This TODO gets done.
1215	Do we still need the ReverseConsecutive case at all? That was introduced in r276477 as a patch for the common case of PR28474, where we find the loads in reverse order. But this should completely supersede it, right?
1215	Why not for VL.size() == 2?
1217	This looks rather weird. Can you make it more idiomatic?
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
10	Please add a test that has several load packets (e.g. multiplies one load sequence by another load sequence).
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
4	Yes, I'd be interested to know if we added a shuffle here, or just moved a shuffle from the store side to the load side (which makes sense).

Hi Simon, Michael

Thanks for the comments. Pls find the response inlined.

Thanks,
Shahid

lib/Analysis/LoopAccessAnalysis.cpp
1065	This is because the elements in the multimap follow a certain order, so using this will ensure that the values are sorted accordingly.
lib/Transforms/Vectorize/SLPVectorizer.cpp
466	What about isFoundJumbled()?
468	Pre-sorting would require two calls for sort and then compare, IMO, for the given small VL.size it would not make much difference. However I am open to other views.
1196–1197	Yes, that's right.
1215	A jumbled VL of VL.size() == 2 is essentially a case of reversed VL. Considering the tradeoff between compile time of extra buildTree() for VL.size==2 vs additional runtime for shufflevector, I opted for extra compile time over extra runtime.
1217	Sure
1226	Seems yes.
2580	Sure
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
3	By "current output" do you mean output generated by utils/update_test_checks.py with this patch by ?
10	Sure
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
4	By "current output" do you mean output generated by utils/update_test_checks.py with this patch by ? Pls explain.

RKSimon added inline comments.Nov 22 2016, 1:31 PM

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
3	I mean commit the current (pre-patch) codegen so that this patch demonstrates the diff.

Updates the review comments and the also updates the test case with more context.

Sorry for the delay, I was on vacation.

lib/Analysis/LoopAccessAnalysis.cpp
1076	Why are you turning a constant into a SCEV and back into a constant?
1077	If you know this is a SCEVConstant, this should be a cast<>. Otherwise, you need to check the dyn_cast<> actually succeeded.
lib/Transforms/Vectorize/SLPVectorizer.cpp
413–416	Please add an explanation for what the VL parameters means here.
468	To be honest, I'm not sure - so I'd appreciate another opinion about pre-sorting. Matt/Hal/Simon?
1215	The trade off here is more one of code complexity - is the gain in compile time worth having all the additional logic present for both the "fully unsorted" case and the "reversed" case.

mkuper added inline comments.Nov 30 2016, 1:02 PM

lib/Analysis/LoopAccessAnalysis.cpp
1062	The LLVM coding standard is that function names start with a non-capital, and variable names start with a capital. (There are some exceptions for functions, but this is mostly in old code.)
1065	It doesn't seem right to use a multimap just for the sorting behavior. I think you can find a more appropriate container. See http://llvm.org/docs/ProgrammersManual.html#picking-the-right-data-structure-for-a-task
1065	Also, capitalization.

RKSimon added inline comments.Nov 30 2016, 1:13 PM

lib/Analysis/LoopAccessAnalysis.cpp
1072	You are casting to PointerType and then only using it as a Type.
lib/Transforms/Vectorize/SLPVectorizer.cpp
1263–1264	for (unsigned i = 0, e = VL.size(); i < e; ++i) {
1358–1359	for (unsigned j = 0, e = VL.size(); j < e; ++j) {
1370–1371	for (unsigned j = 0, e = VL.size(); j < e; ++j) {
1381–1382	for (unsigned j = 0, e = VL.size(); j < e; ++j) {
2576	for (unsigned i = 0, e = VecTy->getNumElements(); i < e; ++i) {
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20	What can be done to avoid this regression?

mkuper added inline comments.Nov 30 2016, 1:21 PM

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20	Ohh, right, wanted to ask about this as well. My guess is that this wasn't actually a regression, but we moved the shuffle from store side to the load side. Is that right?

RKSimon added inline comments.Dec 1 2016, 1:42 AM

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20	If the update_test_checks script has done its job and generated checks for all the IR then this is an additional shuffle, I can't see an equivalent shuffle or set of extracts in the codegen on the left.

mssimpso added inline comments.Dec 1 2016, 10:33 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
468	I think we currently limit VL.size() to a maximum of 16? If so, the gain may not be that much, but I wouldn't expect presorting to be any worse.
1215	Am I wrong in thinking we don't necessarily know if rebuilding the tree with reversed loads would be any better than having the shuffle? Previously we were going to bail, but now we have an option.
2574	I probably missed this, but why are we checking the sizes? Does this mean there will be cases where E->NeedToShuffle is true but we don't generate the shuffle?

mkuper added inline comments.Dec 1 2016, 10:56 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1215	I don't think you're wrong, on the contrary - I was advocating removing the code I added (for reversing loads) and completely replacing it with something like this. But I didn't realize that'll introduce an extra shuffle.
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51	What happens if the stores are also out of order? (IIRC, we should already have code to deal with that, I just want to make sure it meshes with the stores being out of order correctly)
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20	Argh, I didn't even look at the new version of the test, my assumption was from looking at the non-generated one (which is even more embarrassing, since I originally wrote that test, and didn't remember it doesn't have a shuffle...) We really should not be regressing this.

• ashahid added inline comments.Dec 2 2016, 2:58 AM

lib/Analysis/LoopAccessAnalysis.cpp
1062	ok
1065	I did refer to this manual but I could not find some thing similar.I am curious, what issue do you see with the usage of multimap? BTW, If you have any specific container in your mind, pls let me know.
1072	This is to resolve method membership error "class llvm::Type’ has no member named ‘getElementType" during compile time .
lib/Transforms/Vectorize/SLPVectorizer.cpp
413–416	Sure
1263–1264	Do you want me to change the style of FOR statement to the above one?
2574	No, I want to ensure that resulting vector type is not differing due to the length of the vector value.
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51	I have not checked yet, but I will check.

RKSimon added inline comments.Dec 2 2016, 5:34 AM

lib/Analysis/LoopAccessAnalysis.cpp
1072	Sorry my mistake!
lib/Transforms/Vectorize/SLPVectorizer.cpp
1263–1264	Yes please - since you're touching this code, it might as well be dealt with. It can be done as a NFC pre-commit if you prefer to keep this patch cleaner.

mkuper added inline comments.Dec 2 2016, 10:44 AM

lib/Analysis/LoopAccessAnalysis.cpp
1065	Well, first, I don't believe you actually need a mutlimap here, right? We don't actually expect to get several elements with the same offset, we can fail immediately if that happens. So, you could replace the multimap with a regular map, and a check for the "multi" condition. Assuming I'm not missing anything about that, the options are basically either a regular std::map, or a sorted vector ( http://llvm.org/docs/ProgrammersManual.html#dss-sortedvectormap )

Updated the review comments accordingly.

RKSimon added inline comments.Dec 7 2016, 4:26 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2576	clang-format?

Updated the comment for formatting and a test to incorporate this patch.

ping!

lib/Analysis/LoopAccessAnalysis.cpp
1065	Agreed. Updating the patch accordingly.
1076	My bad, refactored accordingly.
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51	It gels well with 'stores' being out-of-order by generating proper shufflemask for loads according to the out-of-order stores.

RKSimon added inline comments.Dec 15 2016, 8:26 AM

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20	Any luck with working out what is causing this regression? Cross lane shuffles can be quite expensive.

mssimpso added inline comments.Dec 15 2016, 8:55 AM

include/llvm/Analysis/LoopAccessAnalysis.h
695	Should this be renamed to sortMemAccesses? If so, the comment above should also be updated: "jumbled memory accesses". Also, should we be returning a SmallVector here? We could also pass a SmallVectorImpl<Value *> &Sorted to the function and place the sorted values there.
lib/Analysis/LoopAccessAnalysis.cpp
1068	Please update comment since you're no longer using a multimap.
lib/Transforms/Vectorize/SLPVectorizer.cpp
413–416	Can we be a bit more explicit about VL. Are VL the scalar roots of the vectorizable tree?
2574	I don't think I fully understand this yet. Can you please make the comment more detailed. In particular, when does VL.size() not equal Scalars.size()? Is this the case when a bundle gets split up into smaller chunks? And then if this is true, what does it imply for the jumbled accesses. It looks like we will end up with a vector load still, but then when are they placed in the right order? Sorry if this should all be obvious!

• ashahid added inline comments.Dec 21 2016, 3:06 AM

include/llvm/Analysis/LoopAccessAnalysis.h
695	Ok, I will do that.
lib/Analysis/LoopAccessAnalysis.cpp
1068	Oh, sure.
lib/Transforms/Vectorize/SLPVectorizer.cpp
413–416	No, it is not scalar roots of vectorizable tree. VL is all isomorphic scalars , for example ADD1, ADD2 and so on or LOAD1 , LOAD2 etc
2574	As such I don't expect VL.size() not equal to Scalars.size(), but if it is so, the compiler may throw assertion for incorrect vector types. I just wanted to avoid that. May be I am presuming it. I will check by avoiding this specific check.
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20	The regression is because here the order of scalar loads are reverse consecutive initially. I will update the patch to resolve it.

mssimpso added inline comments.Dec 21 2016, 7:20 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2574	Should we make the size check an assertion?

Updated the patch for the recent review comments which resolves the regressions in the given tests.

mssimpso added inline comments.Jan 3 2017, 8:32 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576	Hi Shahid, I'm hitting the assertion here while testing this patch. Can you take a look?

mssimpso added inline comments.Jan 3 2017, 8:59 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2596	I also saw verifier failures where TBAA metadata had been applied to the shuffle, like: TBAA is only for loads, stores and calls! %14 = shufflevector <4 x i32> %13, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1>, !tbaa !66

• ashahid added inline comments.Jan 3 2017, 7:11 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576	Sure. If possible can you share the asserting test?
2596	Ok, will fix it.

RKSimon added inline comments.Jan 4 2017, 8:14 AM

test/Transforms/SLPVectorizer/X86/horizontal-list.ll
12 ↗	(On Diff #82406)	The changes in this file are from the regeneration script and are just polluting this patch, I've commit this against trunk at rL290969 - please rebase.
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
35	This looks suspicious - why the lonely change from TMP3 to TMP4?

mssimpso added inline comments.Jan 4 2017, 8:23 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576	Sure, I'll try and reduce something for you.
2596	I think you should probably just copy the metadata from the scalar load to the vector load, like: propagateMetadata(LI, E->Scalars); return Shuf;

• ashahid added inline comments.Jan 4 2017, 8:51 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2596	Yes, TBAA metadata is not for shufflevector and recently verifier added this assert.
test/Transforms/SLPVectorizer/X86/horizontal-list.ll
12 ↗	(On Diff #82406)	Ok
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
35	Oh good catch, I will see.

mssimpso added inline comments.Jan 4 2017, 9:08 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576	OK, you should be able to reproduce the assert with the bugpoint reduced test case at P7950. Thanks! opt < D26905.ll -slp-vectorizer -S

• ashahid added inline comments.Jan 6 2017, 1:45 AM

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
35	I was surprised initially but later realized that this is because the current patch resolves the regression you pointed out. So if you compare this patch i.e Diff5 with the previous patch i.e Diff4, you will see the expected difference

Updated the patch to fix the assertion observed by Simon (thanks for the reduced test) and other comments.

No other comments from me. Thanks.

mkuper added inline comments.Jan 6 2017, 2:46 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
468	I'm still not sure we want this to be quadratic. I'd suggest one of two things: Change this to presort. For VL.size() == 4, it may be slower, but for VL.size() == 16, I'd expect it to be faster. If there's evidence that presorting is actually bad for small sizes, add a FIXME and bail out for VL.size() > 16. I'd prefer for us to fail to vectorize at larger VLs, than silently introduce a quadratic algorithm for larger Ns.

Updated the patch accordingly to address the comment

Thanks, Shahid.
The rest of my comments are cosmetic - except the one about the sort. I think your sort accidentally ended up quadratic.

lib/Analysis/LoopAccessAnalysis.cpp
1077	Shouldn't this sort call be outside the for loop?
lib/Transforms/Vectorize/SLPVectorizer.cpp
471	Can you just use std::equal directly? The only thing isSame() does, aside from that, is assert on the sizes - and you have that assert just a few lines above.
1228	newVL -> NewVL
1229	Could you please add braces to this for? It's a one-statement body, but not a one-line, so I think braces would be better.

mkuper added subscribers: wmi, • dberlin.Jan 12 2017, 7:49 PM

mkuper added inline comments.

lib/Analysis/LoopAccessAnalysis.cpp
1075	Thanks to @dberlin and @wmi - I now realize this is too simplistic. It will handle cases where the offsets are constant, but not when all the offsets are variable, but the variables have constant differences from each other. Anyway that can be handled separately. Could you add a FIXME here, please?

Updated the patch accordingly.

A few more cosmetic comments (sorry I didn't ask you to fix them all at once, but I keep noticing new ones every time I read the code).
Also, there's another thing I just realized is missing from this patch - we don't consider NeedToShuffle in getEntryCost().
You basically need to add the cost of a TTI::SK_PermuteSingleSrc shuffle to every NeedToShuffle load.

(Actually, it's a bit more complicated than that, since some of those shuffles may end up getting removed later, but it's probably better to be conservative here.)

lib/Analysis/LoopAccessAnalysis.cpp
1065	Also, this isn't a pair, it's a list of pairs. OffValPairs can work.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2594	shuf -> Shuf
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51	I thought you added a test for the combination of out-of-order loads and out-of-order stores, but turns out I was imagining it. Could you please add one? (We should have a regression test making sure we don't generate extra shuffles.)

Sorry for the delayed response. Updated the patch to include costing for extra shuffle and minor formatting.

LGTM, Thanks!

test/Transforms/SLPVectorizer/X86/store-jumbled.ll
14 ↗	(On Diff #86031)	Ok, so this is pretty much what I thought will happen. We shuffle both loads the same way, and then multiply, instead of multiplying and then shuffling. But this is probably fine - I hope InstCombine will pick up on this and combine it to a mul followed by a shuffle, if the masks match.

This revision is now accepted and ready to land.Jan 27 2017, 10:53 AM

Closed by commit rL293386: [SLP] Vectorize loads of consecutive memory accesses, accessed in non… (authored by • ashahid). · Explain WhyJan 28 2017, 10:10 AM

This revision was automatically updated to reflect the committed changes.

• ashahid added inline comments.Jan 31 2017, 1:35 AM

test/Transforms/SLPVectorizer/X86/store-jumbled.ll
14 ↗	(On Diff #86031)	Yes you are right. I verified, its happening exactly as you explained

• ashahid mentioned this in D36130: [SLP] Vectorize jumbled memory loads..Aug 1 2017, 12:31 AM

• ashahid mentioned this in rL313736: [SLP] Vectorize jumbled memory loads..Sep 20 2017, 1:20 AM

• ashahid mentioned this in rL313771: [SLP] Vectorize jumbled memory loads..Sep 20 2017, 10:21 AM

hans mentioned this in rL313781: Revert r313771 "[SLP] Vectorize jumbled memory loads.".Sep 20 2017, 11:02 AM

• ashahid mentioned this in rL314806: [SLP] Vectorize jumbled memory loads..Oct 3 2017, 8:30 AM

hans mentioned this in rL314824: Revert r314806 "[SLP] Vectorize jumbled memory loads.".Oct 3 2017, 11:34 AM

• ashahid mentioned this in rL320548: [SLP] Vectorize jumbled memory loads..Dec 12 2017, 7:09 PM

Revision Contents

Path

Size

include/

llvm/

Analysis/

LoopAccessAnalysis.h

5 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

27 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

172 lines

test/

Transforms/

SLPVectorizer/

X86/

jumbled-load.ll

25 lines

reduction_loads.ll

4 lines

Diff 83923

include/llvm/Analysis/LoopAccessAnalysis.h

	Show First 20 Lines • Show All 684 Lines • ▼ Show 20 Lines
	/// If necessary this method will version the stride of the pointer according			/// If necessary this method will version the stride of the pointer according
	/// to \p PtrToStride and therefore add further predicates to \p PSE.			/// to \p PtrToStride and therefore add further predicates to \p PSE.
	/// The \p Assume parameter indicates if we are allowed to make additional			/// The \p Assume parameter indicates if we are allowed to make additional
	/// run-time assumptions.			/// run-time assumptions.
	int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,			int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,
	const ValueToValueMap &StridesMap = ValueToValueMap(),			const ValueToValueMap &StridesMap = ValueToValueMap(),
	bool Assume = false, bool ShouldCheckWrap = true);			bool Assume = false, bool ShouldCheckWrap = true);

				/// \brief Saves the sorted memory accesses in vector argument 'Sorted' after
				/// sorting the jumbled memory accesses.
				void sortMemAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
				mssimpsoUnsubmitted Not Done Reply Inline Actions Should this be renamed to sortMemAccesses? If so, the comment above should also be updated: "jumbled memory accesses". Also, should we be returning a SmallVector here? We could also pass a SmallVectorImpl<Value > &Sorted to the function and place the sorted values there. mssimpso:* Should this be renamed to sortMemAccesses? If so, the comment above should also be updated…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Ok, I will do that. ashahid: Ok, I will do that.
				ScalarEvolution &SE, SmallVectorImpl<Value *> &Sorted);

	/// \brief Returns true if the memory operations \p A and \p B are consecutive.			/// \brief Returns true if the memory operations \p A and \p B are consecutive.
	/// This is a simple API that does not depend on the analysis pass.			/// This is a simple API that does not depend on the analysis pass.
	bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,			bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,
	ScalarEvolution &SE, bool CheckType = true);			ScalarEvolution &SE, bool CheckType = true);

	/// \brief This analysis provides dependence information for the memory accesses			/// \brief This analysis provides dependence information for the memory accesses
	/// of a loop.			/// of a loop.
	///			///
	▲ Show 20 Lines • Show All 82 Lines • Show Last 20 Lines

lib/Analysis/LoopAccessAnalysis.cpp

	Show First 20 Lines • Show All 1,052 Lines • ▼ Show 20 Lines
	static unsigned getAddressSpaceOperand(Value *I) {			static unsigned getAddressSpaceOperand(Value *I) {
	if (LoadInst *L = dyn_cast<LoadInst>(I))			if (LoadInst *L = dyn_cast<LoadInst>(I))
	return L->getPointerAddressSpace();			return L->getPointerAddressSpace();
	if (StoreInst *S = dyn_cast<StoreInst>(I))			if (StoreInst *S = dyn_cast<StoreInst>(I))
	return S->getPointerAddressSpace();			return S->getPointerAddressSpace();
	return -1;			return -1;
	}			}

				/// Saves the memory accesses after sorting it into vector argument 'Sorted'.
				void llvm::sortMemAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
				mkuperUnsubmitted Not Done Reply Inline Actions The LLVM coding standard is that function names start with a non-capital, and variable names start with a capital. (There are some exceptions for functions, but this is mostly in old code.) mkuper: The LLVM coding standard is that function names start with a non-capital, and variable names…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions ok ashahid: ok
				ScalarEvolution &SE,
				SmallVectorImpl<Value *> &Sorted) {
				RKSimonUnsubmitted Not Done Reply Inline Actions clang-format this RKSimon: clang-format this
				SmallVector<std::pair<int, Value *>, 4> OffValPair;
				mkuperUnsubmitted Not Done Reply Inline Actions Why is this a multimap? mkuper: Why is this a multimap?
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions This is because the elements in the multimap follow a certain order, so using this will ensure that the values are sorted accordingly. ashahid: This is because the elements in the multimap follow a certain order, so using this will ensure…
				mkuperUnsubmitted Not Done Reply Inline Actions It doesn't seem right to use a multimap just for the sorting behavior. I think you can find a more appropriate container. See http://llvm.org/docs/ProgrammersManual.html#picking-the-right-data-structure-for-a-task mkuper: It doesn't seem right to use a multimap just for the sorting behavior. I think you can find a…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions I did refer to this manual but I could not find some thing similar.I am curious, what issue do you see with the usage of multimap? BTW, If you have any specific container in your mind, pls let me know. ashahid: I did refer to this manual but I could not find some thing similar.I am curious, what issue do…
				mkuperUnsubmitted Not Done Reply Inline Actions Well, first, I don't believe you actually need a mutlimap here, right? We don't actually expect to get several elements with the same offset, we can fail immediately if that happens. So, you could replace the multimap with a regular map, and a check for the "multi" condition. Assuming I'm not missing anything about that, the options are basically either a regular std::map, or a sorted vector ( http://llvm.org/docs/ProgrammersManual.html#dss-sortedvectormap ) mkuper: Well, first, I don't believe you actually need a mutlimap here, right? We don't actually…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Agreed. Updating the patch accordingly. ashahid: Agreed. Updating the patch accordingly.
				mkuperUnsubmitted Not Done Reply Inline Actions Also, capitalization. mkuper: Also, capitalization.
				mkuperUnsubmitted Not Done Reply Inline Actions Also, this isn't a pair, it's a list of pairs. OffValPairs can work. mkuper: Also, this isn't a pair, it's a list of pairs. OffValPairs can work.
				for (auto *Val : VL) {
				RKSimonUnsubmitted Not Done Reply Inline Actions This can probably be replaced with a for range loop: for (auto Val : VL) { and then replace the uses of VL[i] with Val RKSimon:* This can probably be replaced with a for range loop: ``` for (auto *Val : VL) { ``` and then…
				// Compute the constant offset from the base pointer of each memory accesses
				mkuperUnsubmitted Not Done Reply Inline Actions Could you add some documentation to explain what exactly this does? mkuper: Could you add some documentation to explain what exactly this does?
				// and insert into the vector of key,value pair which needs to be sorted.
				mssimpsoUnsubmitted Not Done Reply Inline Actions Please update comment since you're no longer using a multimap. mssimpso: Please update comment since you're no longer using a multimap.
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Oh, sure. ashahid: Oh, sure.
				Value *Ptr = getPointerOperand(Val);
				unsigned AS = getAddressSpaceOperand(Val);
				unsigned PtrBitWidth = DL.getPointerSizeInBits(AS);
				Type *Ty = cast<PointerType>(Ptr->getType())->getElementType();
				RKSimonUnsubmitted Not Done Reply Inline Actions You are casting to PointerType and then only using it as a Type. RKSimon: You are casting to PointerType and then only using it as a Type.
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions This is to resolve method membership error "class llvm::Type’ has no member named ‘getElementType" during compile time . ashahid: This is to resolve method membership error "class llvm::Type’ has no member named…
				RKSimonUnsubmitted Not Done Reply Inline Actions Sorry my mistake! RKSimon: Sorry my mistake!
				APInt Size(PtrBitWidth, DL.getTypeStoreSize(Ty));
				APInt Offset(PtrBitWidth, 0);
				Ptr->stripAndAccumulateInBoundsConstantOffsets(DL, Offset);
				RKSimonUnsubmitted Not Done Reply Inline Actions use auto* for dyn_cast return. RKSimon: use auto* for dyn_cast return.
				mkuperUnsubmitted Not Done Reply Inline Actions Thanks to @dberlin and @wmi - I now realize this is too simplistic. It will handle cases where the offsets are constant, but not when all the offsets are variable, but the variables have constant differences from each other. Anyway that can be handled separately. Could you add a FIXME here, please? mkuper: Thanks to @dberlin and @wmi - I now realize this is too simplistic. It will handle cases where…
				OffValPair.push_back(std::make_pair(Offset.getSExtValue(), Val));
				mkuperUnsubmitted Not Done Reply Inline Actions Why are you turning a constant into a SCEV and back into a constant? mkuper: Why are you turning a constant into a SCEV and back into a constant?
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions My bad, refactored accordingly. ashahid: My bad, refactored accordingly.
				std::sort(OffValPair.begin(), OffValPair.end(),
				mkuperUnsubmitted Not Done Reply Inline Actions If you know this is a SCEVConstant, this should be a cast<>. Otherwise, you need to check the dyn_cast<> actually succeeded. mkuper: If you know this is a SCEVConstant, this should be a cast<>. Otherwise, you need to check the…
				mkuperUnsubmitted Not Done Reply Inline Actions Shouldn't this sort call be outside the for loop? mkuper: Shouldn't this sort call be outside the for loop?
				[](const std::pair<int, Value *> &left,
				const std::pair<int, Value *> &right) {
				return left.first < right.first;
				});
				RKSimonUnsubmitted Not Done Reply Inline Actions newVL is unused? RKSimon: newVL is unused?
				}

				RKSimonUnsubmitted Not Done Reply Inline Actions Use auto? You can drop the braces as well. RKSimon: Use auto? You can drop the braces as well.
				for (auto& it : OffValPair)
				Sorted.push_back(it.second);
				}

	/// Returns true if the memory operations \p A and \p B are consecutive.			/// Returns true if the memory operations \p A and \p B are consecutive.
	bool llvm::isConsecutiveAccess(Value A, Value B, const DataLayout &DL,			bool llvm::isConsecutiveAccess(Value A, Value B, const DataLayout &DL,
	ScalarEvolution &SE, bool CheckType) {			ScalarEvolution &SE, bool CheckType) {
	Value *PtrA = getPointerOperand(A);			Value *PtrA = getPointerOperand(A);
	Value *PtrB = getPointerOperand(B);			Value *PtrB = getPointerOperand(B);
	unsigned ASA = getAddressSpaceOperand(A);			unsigned ASA = getAddressSpaceOperand(A);
	unsigned ASB = getAddressSpaceOperand(B);			unsigned ASB = getAddressSpaceOperand(B);

	▲ Show 20 Lines • Show All 1,092 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 404 Lines • ▼ Show 20 Lines	private:

/// This is the recursive part of buildTree.		/// This is the recursive part of buildTree.
void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth);		void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth);

/// \returns True if the ExtractElement/ExtractValue instructions in VL can		/// \returns True if the ExtractElement/ExtractValue instructions in VL can
/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).		/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).
bool canReuseExtract(ArrayRef<Value *> VL, unsigned Opcode) const;		bool canReuseExtract(ArrayRef<Value *> VL, unsigned Opcode) const;

/// Vectorize a single entry in the tree.		/// Vectorize a single entry in the tree. VL icontains all isomorphic scalars
Value vectorizeTree(TreeEntry E);		/// in order of its usage in a user program, for example ADD1, ADD2 and so on
		/// or LOAD1 , LOAD2 etc.
		Value vectorizeTree(ArrayRef<Value > VL, TreeEntry *E);
		mkuperUnsubmitted Not Done Reply Inline Actions Please add an explanation for what the VL parameters means here. mkuper: Please add an explanation for what the VL parameters means here.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure ashahid: Sure
		mssimpsoUnsubmitted Not Done Reply Inline Actions Can we be a bit more explicit about VL. Are VL the scalar roots of the vectorizable tree? mssimpso: Can we be a bit more explicit about VL. Are VL the scalar roots of the vectorizable tree?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions No, it is not scalar roots of vectorizable tree. VL is all isomorphic scalars , for example ADD1, ADD2 and so on or LOAD1 , LOAD2 etc ashahid: No, it is not scalar roots of vectorizable tree. VL is all isomorphic scalars , for example…

/// Vectorize a single entry in the tree, starting in \p VL.		/// Vectorize a single entry in the tree, starting in \p VL.
Value vectorizeTree(ArrayRef<Value > VL);		Value vectorizeTree(ArrayRef<Value > VL);

/// \returns the pointer to the vectorized value if \p VL is already		/// \returns the pointer to the vectorized value if \p VL is already
/// vectorized, or NULL. They may happen in cycles.		/// vectorized, or NULL. They may happen in cycles.
Value alreadyVectorized(ArrayRef<Value > VL) const;		Value alreadyVectorized(ArrayRef<Value > VL) const;

Show All 24 Lines	void reorderAltShuffleOperands(ArrayRef<Value *> VL,
SmallVectorImpl<Value *> &Right);		SmallVectorImpl<Value *> &Right);
/// \reorder commutative operands to get better probability of		/// \reorder commutative operands to get better probability of
/// generating vectorized code.		/// generating vectorized code.
void reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,		void reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,
SmallVectorImpl<Value *> &Left,		SmallVectorImpl<Value *> &Left,
SmallVectorImpl<Value *> &Right);		SmallVectorImpl<Value *> &Right);
struct TreeEntry {		struct TreeEntry {
TreeEntry() : Scalars(), VectorizedValue(nullptr),		TreeEntry() : Scalars(), VectorizedValue(nullptr),
NeedToGather(0) {}		NeedToGather(0), NeedToShuffle(0) {}

/// \returns true if the scalars in VL are equal to this entry.		/// \returns true if the scalars in VL are equal to this entry.
bool isSame(ArrayRef<Value *> VL) const {		bool isSame(ArrayRef<Value *> VL) const {
assert(VL.size() == Scalars.size() && "Invalid size");		assert(VL.size() == Scalars.size() && "Invalid size");
return std::equal(VL.begin(), VL.end(), Scalars.begin());		return std::equal(VL.begin(), VL.end(), Scalars.begin());
}		}

		/// \returns true if the scalars in VL are found in this tree entry.
		bool isFoundJumbled(ArrayRef<Value *> VL, const DataLayout &DL,
		mkuperUnsubmitted Not Done Reply Inline Actions I think you may want a different name - this doesn't actually check whether the scalars are jumbled, it checks whether they're all present. That is, it'll return true even if they're all in-order. mkuper: I think you may want a different name - this doesn't actually check whether the scalars are…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions What about isFoundJumbled()? ashahid: What about isFoundJumbled()?
		ScalarEvolution &SE) const {
		assert(VL.size() == Scalars.size() && "Invalid size");
		RKSimonUnsubmitted Not Done Reply Inline Actions This might be simplified with a for range loop and use of llvm:none_of / any_of? RKSimon: This might be simplified with a for range loop and use of llvm:none_of / any_of?
		mkuperUnsubmitted Not Done Reply Inline Actions Would it make sense to pre-sort both arrays, and then check the two sorted arrays are equal? This would make it O(nlogn) instead of O(n^2) (I'm not sure sort based on what, though - as well as the actual gain, since I guess VL.size() is small in practice.) mkuper: Would it make sense to pre-sort both arrays, and then check the two sorted arrays are equal?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Pre-sorting would require two calls for sort and then compare, IMO, for the given small VL.size it would not make much difference. However I am open to other views. ashahid: Pre-sorting would require two calls for sort and then compare, IMO, for the given small VL.size…
		mkuperUnsubmitted Not Done Reply Inline Actions To be honest, I'm not sure - so I'd appreciate another opinion about pre-sorting. Matt/Hal/Simon? mkuper: To be honest, I'm not sure - so I'd appreciate another opinion about pre-sorting.
		mssimpsoUnsubmitted Not Done Reply Inline Actions I think we currently limit VL.size() to a maximum of 16? If so, the gain may not be that much, but I wouldn't expect presorting to be any worse. mssimpso: I think we currently limit VL.size() to a maximum of 16? If so, the gain may not be that much…
		mkuperUnsubmitted Not Done Reply Inline Actions I'm still not sure we want this to be quadratic. I'd suggest one of two things: Change this to presort. For VL.size() == 4, it may be slower, but for VL.size() == 16, I'd expect it to be faster. If there's evidence that presorting is actually bad for small sizes, add a FIXME and bail out for VL.size() > 16. I'd prefer for us to fail to vectorize at larger VLs, than silently introduce a quadratic algorithm for larger Ns. mkuper: I'm still not sure we want this to be quadratic. I'd suggest one of two things: 1) Change this…
		SmallVector<Value *, 8> list;
		sortMemAccesses(VL, DL, SE, list);
		return isSame(list);
		mkuperUnsubmitted Not Done Reply Inline Actions Can you just use std::equal directly? The only thing isSame() does, aside from that, is assert on the sizes - and you have that assert just a few lines above. mkuper: Can you just use std::equal directly? The only thing isSame() does, aside from that, is assert…
		}

/// A vector of scalars.		/// A vector of scalars.
ValueList Scalars;		ValueList Scalars;

/// The Scalars are vectorized into this value. It is initialized to Null.		/// The Scalars are vectorized into this value. It is initialized to Null.
Value *VectorizedValue;		Value *VectorizedValue;

/// Do we need to gather this sequence ?		/// Do we need to gather this sequence ?
bool NeedToGather;		bool NeedToGather;

		/// Do we need to shuffle the load ?
		bool NeedToShuffle;
};		};

/// Create a new VectorizableTree entry.		/// Create a new VectorizableTree entry.
TreeEntry newTreeEntry(ArrayRef<Value > VL, bool Vectorized) {		TreeEntry newTreeEntry(ArrayRef<Value > VL, bool Vectorized,
		bool NeedToShuffle) {
VectorizableTree.emplace_back();		VectorizableTree.emplace_back();
int idx = VectorizableTree.size() - 1;		int idx = VectorizableTree.size() - 1;
TreeEntry *Last = &VectorizableTree[idx];		TreeEntry *Last = &VectorizableTree[idx];
Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());		Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
Last->NeedToGather = !Vectorized;		Last->NeedToGather = !Vectorized;
		Last->NeedToShuffle = NeedToShuffle;
if (Vectorized) {		if (Vectorized) {
for (int i = 0, e = VL.size(); i != e; ++i) {		for (int i = 0, e = VL.size(); i != e; ++i) {
assert(!ScalarToTreeEntry.count(VL[i]) && "Scalar already in tree!");		assert(!ScalarToTreeEntry.count(VL[i]) && "Scalar already in tree!");
ScalarToTreeEntry[VL[i]] = idx;		ScalarToTreeEntry[VL[i]] = idx;
}		}
} else {		} else {
MustGather.insert(VL.begin(), VL.end());		MustGather.insert(VL.begin(), VL.end());
}		}
▲ Show 20 Lines • Show All 490 Lines • ▼ Show 20 Lines


void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth) {		void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth) {
bool isAltShuffle = false;		bool isAltShuffle = false;
assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");		assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");

if (Depth == RecursionMaxDepth) {		if (Depth == RecursionMaxDepth) {
DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");		DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}

// Don't handle vectors.		// Don't handle vectors.
if (VL[0]->getType()->isVectorTy()) {		if (VL[0]->getType()->isVectorTy()) {
DEBUG(dbgs() << "SLP: Gathering due to vector type.\n");		DEBUG(dbgs() << "SLP: Gathering due to vector type.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}

if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
if (SI->getValueOperand()->getType()->isVectorTy()) {		if (SI->getValueOperand()->getType()->isVectorTy()) {
DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");		DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
unsigned Opcode = getSameOpcode(VL);		unsigned Opcode = getSameOpcode(VL);

// Check that this shuffle vector refers to the alternate		// Check that this shuffle vector refers to the alternate
// sequence of opcodes.		// sequence of opcodes.
if (Opcode == Instruction::ShuffleVector) {		if (Opcode == Instruction::ShuffleVector) {
Instruction *I0 = dyn_cast<Instruction>(VL[0]);		Instruction *I0 = dyn_cast<Instruction>(VL[0]);
unsigned Op = I0->getOpcode();		unsigned Op = I0->getOpcode();
if (Op != Instruction::ShuffleVector)		if (Op != Instruction::ShuffleVector)
isAltShuffle = true;		isAltShuffle = true;
}		}

// If all of the operands are identical or constant we have a simple solution.		// If all of the operands are identical or constant we have a simple solution.
if (allConstant(VL) \|\| isSplat(VL) \|\| !allSameBlock(VL) \|\| !Opcode) {		if (allConstant(VL) \|\| isSplat(VL) \|\| !allSameBlock(VL) \|\| !Opcode) {
DEBUG(dbgs() << "SLP: Gathering due to C,S,B,O. \n");		DEBUG(dbgs() << "SLP: Gathering due to C,S,B,O. \n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}

// We now know that this is a vector of instructions of the same type from		// We now know that this is a vector of instructions of the same type from
// the same block.		// the same block.

// Don't vectorize ephemeral values.		// Don't vectorize ephemeral values.
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
if (EphValues.count(VL[i])) {		if (EphValues.count(VL[i])) {
DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<		DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<
") is ephemeral.\n");		") is ephemeral.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}

// Check if this is a duplicate of another entry.		// Check if this is a duplicate of another entry.
if (ScalarToTreeEntry.count(VL[0])) {		if (ScalarToTreeEntry.count(VL[0])) {
int Idx = ScalarToTreeEntry[VL[0]];		int Idx = ScalarToTreeEntry[VL[0]];
TreeEntry *E = &VectorizableTree[Idx];		TreeEntry *E = &VectorizableTree[Idx];
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
DEBUG(dbgs() << "SLP: \tChecking bundle: " << *VL[i] << ".\n");		DEBUG(dbgs() << "SLP: \tChecking bundle: " << *VL[i] << ".\n");
if (E->Scalars[i] != VL[i]) {		if (E->Scalars[i] != VL[i]) {
DEBUG(dbgs() << "SLP: Gathering due to partial overlap.\n");		DEBUG(dbgs() << "SLP: Gathering due to partial overlap.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}
DEBUG(dbgs() << "SLP: Perfect diamond merge at " << *VL[0] << ".\n");		DEBUG(dbgs() << "SLP: Perfect diamond merge at " << *VL[0] << ".\n");
return;		return;
}		}

// Check that none of the instructions in the bundle are already in the tree.		// Check that none of the instructions in the bundle are already in the tree.
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
if (ScalarToTreeEntry.count(VL[i])) {		if (ScalarToTreeEntry.count(VL[i])) {
DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<		DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<
") is already in tree.\n");		") is already in tree.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}

// If any of the scalars is marked as a value that needs to stay scalar then		// If any of the scalars is marked as a value that needs to stay scalar then
// we need to gather the scalars.		// we need to gather the scalars.
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
if (MustGather.count(VL[i])) {		if (MustGather.count(VL[i])) {
DEBUG(dbgs() << "SLP: Gathering due to gathered scalar.\n");		DEBUG(dbgs() << "SLP: Gathering due to gathered scalar.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}

// Check that all of the users of the scalars that we want to vectorize are		// Check that all of the users of the scalars that we want to vectorize are
// schedulable.		// schedulable.
Instruction *VL0 = cast<Instruction>(VL[0]);		Instruction *VL0 = cast<Instruction>(VL[0]);
BasicBlock *BB = cast<Instruction>(VL0)->getParent();		BasicBlock *BB = cast<Instruction>(VL0)->getParent();

if (!DT->isReachableFromEntry(BB)) {		if (!DT->isReachableFromEntry(BB)) {
// Don't go into unreachable blocks. They may contain instructions with		// Don't go into unreachable blocks. They may contain instructions with
// dependency cycles which confuse the final scheduling.		// dependency cycles which confuse the final scheduling.
DEBUG(dbgs() << "SLP: bundle in unreachable block.\n");		DEBUG(dbgs() << "SLP: bundle in unreachable block.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}

// Check that every instructions appears once in this bundle.		// Check that every instructions appears once in this bundle.
for (unsigned i = 0, e = VL.size(); i < e; ++i)		for (unsigned i = 0, e = VL.size(); i < e; ++i)
for (unsigned j = i+1; j < e; ++j)		for (unsigned j = i+1; j < e; ++j)
if (VL[i] == VL[j]) {		if (VL[i] == VL[j]) {
DEBUG(dbgs() << "SLP: Scalar used twice in bundle.\n");		DEBUG(dbgs() << "SLP: Scalar used twice in bundle.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}

auto &BSRef = BlocksSchedules[BB];		auto &BSRef = BlocksSchedules[BB];
if (!BSRef) {		if (!BSRef) {
BSRef = llvm::make_unique<BlockScheduling>(BB);		BSRef = llvm::make_unique<BlockScheduling>(BB);
}		}
BlockScheduling &BS = *BSRef.get();		BlockScheduling &BS = *BSRef.get();

if (!BS.tryScheduleBundle(VL, this)) {		if (!BS.tryScheduleBundle(VL, this)) {
DEBUG(dbgs() << "SLP: We are not able to schedule this bundle!\n");		DEBUG(dbgs() << "SLP: We are not able to schedule this bundle!\n");
assert((!BS.getScheduleData(VL[0]) \|\|		assert((!BS.getScheduleData(VL[0]) \|\|
!BS.getScheduleData(VL[0])->isPartOfBundle()) &&		!BS.getScheduleData(VL[0])->isPartOfBundle()) &&
"tryScheduleBundle should cancelScheduling on failure");		"tryScheduleBundle should cancelScheduling on failure");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");		DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");

switch (Opcode) {		switch (Opcode) {
case Instruction::PHI: {		case Instruction::PHI: {
PHINode *PH = dyn_cast<PHINode>(VL0);		PHINode *PH = dyn_cast<PHINode>(VL0);

// Check for terminator values (e.g. invoke).		// Check for terminator values (e.g. invoke).
for (unsigned j = 0; j < VL.size(); ++j)		for (unsigned j = 0; j < VL.size(); ++j)
for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {		for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {
TerminatorInst *Term = dyn_cast<TerminatorInst>(		TerminatorInst *Term = dyn_cast<TerminatorInst>(
cast<PHINode>(VL[j])->getIncomingValueForBlock(PH->getIncomingBlock(i)));		cast<PHINode>(VL[j])->getIncomingValueForBlock(PH->getIncomingBlock(i)));
if (Term) {		if (Term) {
DEBUG(dbgs() << "SLP: Need to swizzle PHINodes (TerminatorInst use).\n");		DEBUG(dbgs() << "SLP: Need to swizzle PHINodes (TerminatorInst use).\n");
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}

newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a vector of PHINodes.\n");		DEBUG(dbgs() << "SLP: added a vector of PHINodes.\n");

for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {		for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<PHINode>(j)->getIncomingValueForBlock(		Operands.push_back(cast<PHINode>(j)->getIncomingValueForBlock(
PH->getIncomingBlock(i)));		PH->getIncomingBlock(i)));

buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
return;		return;
}		}
case Instruction::ExtractValue:		case Instruction::ExtractValue:
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
bool Reuse = canReuseExtract(VL, Opcode);		bool Reuse = canReuseExtract(VL, Opcode);
if (Reuse) {		if (Reuse) {
DEBUG(dbgs() << "SLP: Reusing extract sequence.\n");		DEBUG(dbgs() << "SLP: Reusing extract sequence.\n");
} else {		} else {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
}		}
newTreeEntry(VL, Reuse);		newTreeEntry(VL, Reuse, false);
return;		return;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Check that a vectorized load would load the same memory as a scalar		// Check that a vectorized load would load the same memory as a scalar
// load.		// load.
// For example we don't want vectorize loads that are smaller than 8 bit.		// For example we don't want vectorize loads that are smaller than 8 bit.
// Even though we have a packed struct {<i2, i2, i2, i2>} LLVM treats		// Even though we have a packed struct {<i2, i2, i2, i2>} LLVM treats
// loading/storing it as an i8 struct. If we vectorize loads/stores from		// loading/storing it as an i8 struct. If we vectorize loads/stores from
// such a struct we read/write packed bits disagreeing with the		// such a struct we read/write packed bits disagreeing with the
// unvectorized version.		// unvectorized version.
Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();

if (DL->getTypeSizeInBits(ScalarTy) !=		if (DL->getTypeSizeInBits(ScalarTy) !=
DL->getTypeAllocSizeInBits(ScalarTy)) {		DL->getTypeAllocSizeInBits(ScalarTy)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");		DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");
return;		return;
}		}

// Make sure all loads in the bundle are simple - we can't vectorize		// Make sure all loads in the bundle are simple - we can't vectorize
// atomic or volatile loads.		// atomic or volatile loads.
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {
LoadInst *L = cast<LoadInst>(VL[i]);		LoadInst *L = cast<LoadInst>(VL[i]);
if (!L->isSimple()) {		if (!L->isSimple()) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");
return;		return;
}		}
}		}

// Check if the loads are consecutive, reversed, or neither.		// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.
bool Consecutive = true;		bool Consecutive = true;
		mkuperUnsubmitted Not Done Reply Inline Actions This TODO gets done. mkuper: This TODO gets done.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, that's right. ashahid: Yes, that's right.
bool ReverseConsecutive = true;		bool ReverseConsecutive = true;
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {
Consecutive = false;		Consecutive = false;
break;		break;
} else {		} else {
ReverseConsecutive = false;		ReverseConsecutive = false;
}		}
}		}

if (Consecutive) {		if (Consecutive) {
++NumLoadsWantToKeepOrder;		++NumLoadsWantToKeepOrder;
newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a vector of loads.\n");		DEBUG(dbgs() << "SLP: added a vector of loads.\n");
return;		return;
}		}

// If none of the load pairs were consecutive when checked in order,		// If none of the load pairs were consecutive when checked in order,
		mkuperUnsubmitted Not Done Reply Inline Actions Do we still need the ReverseConsecutive case at all? That was introduced in r276477 as a patch for the common case of PR28474, where we find the loads in reverse order. But this should completely supersede it, right? mkuper: Do we still need the ReverseConsecutive case at all? That was introduced in r276477 as a patch…
		mkuperUnsubmitted Not Done Reply Inline Actions Why not for VL.size() == 2? mkuper: Why not for VL.size() == 2?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions A jumbled VL of VL.size() == 2 is essentially a case of reversed VL. Considering the tradeoff between compile time of extra buildTree() for VL.size==2 vs additional runtime for shufflevector, I opted for extra compile time over extra runtime. ashahid: A jumbled VL of VL.size() == 2 is essentially a case of reversed VL. Considering the tradeoff…
		mkuperUnsubmitted Not Done Reply Inline Actions The trade off here is more one of code complexity - is the gain in compile time worth having all the additional logic present for both the "fully unsorted" case and the "reversed" case. mkuper: The trade off here is more one of code complexity - is the gain in compile time worth having…
		mssimpsoUnsubmitted Not Done Reply Inline Actions Am I wrong in thinking we don't necessarily know if rebuilding the tree with reversed loads would be any better than having the shuffle? Previously we were going to bail, but now we have an option. mssimpso: Am I wrong in thinking we don't necessarily know if rebuilding the tree with reversed loads…
		mkuperUnsubmitted Not Done Reply Inline Actions I don't think you're wrong, on the contrary - I was advocating removing the code I added (for reversing loads) and completely replacing it with something like this. But I didn't realize that'll introduce an extra shuffle. mkuper: I don't think you're wrong, on the contrary - I was advocating removing the code I added (for…
// check the reverse order.		// check the reverse order.
if (ReverseConsecutive)		if (ReverseConsecutive)
		mkuperUnsubmitted Not Done Reply Inline Actions This looks rather weird. Can you make it more idiomatic? mkuper: This looks rather weird. Can you make it more idiomatic?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure ashahid: Sure
for (unsigned i = VL.size() - 1; i > 0; --i)		for (unsigned i = VL.size() - 1; i > 0; --i)
if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {
ReverseConsecutive = false;		ReverseConsecutive = false;
break;		break;
}		}

		if (VL.size() > 2 && !ReverseConsecutive) {
		bool ShuffledLoads = true;
		SmallVector<Value *, 8> list;
		RKSimonUnsubmitted Not Done Reply Inline Actions Is it worth breaking here once we know that shuffledLoad is false? Remove the braces if you can. RKSimon: Is it worth breaking here once we know that shuffledLoad is false? Remove the braces if you…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Seems yes. ashahid: Seems yes.
		sortMemAccesses(VL, DL, SE, list);
		auto newVL = makeArrayRef(list.begin(), list.end());
		mkuperUnsubmitted Not Done Reply Inline Actions newVL -> NewVL mkuper: newVL -> NewVL
		for (unsigned i = 0, e = newVL.size() - 1; i < e; ++i)
		mkuperUnsubmitted Not Done Reply Inline Actions Could you please add braces to this for? It's a one-statement body, but not a one-line, so I think braces would be better. mkuper: Could you please add braces to this for? It's a one-statement body, but not a one-line, so I…
		if (!isConsecutiveAccess(newVL[i], newVL[i + 1], DL, SE)) {
		ShuffledLoads = false;
		break;
		}
		if (ShuffledLoads) {
		newTreeEntry(newVL, true, true);
		return;
		}
		}

BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);

if (ReverseConsecutive) {		if (ReverseConsecutive) {
++NumLoadsWantToChangeOrder;		++NumLoadsWantToChangeOrder;
DEBUG(dbgs() << "SLP: Gathering reversed loads.\n");		DEBUG(dbgs() << "SLP: Gathering reversed loads.\n");
} else {		} else {
DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
}		}
return;		return;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
case Instruction::PtrToInt:		case Instruction::PtrToInt:
case Instruction::IntToPtr:		case Instruction::IntToPtr:
case Instruction::SIToFP:		case Instruction::SIToFP:
case Instruction::UIToFP:		case Instruction::UIToFP:
case Instruction::Trunc:		case Instruction::Trunc:
case Instruction::FPTrunc:		case Instruction::FPTrunc:
case Instruction::BitCast: {		case Instruction::BitCast: {
Type *SrcTy = VL0->getOperand(0)->getType();		Type *SrcTy = VL0->getOperand(0)->getType();
for (unsigned i = 0; i < VL.size(); ++i) {		for (Value *Val : VL) {
		RKSimonUnsubmitted Not Done Reply Inline Actions for (unsigned i = 0, e = VL.size(); i < e; ++i) { RKSimon: for (unsigned i = 0, e = VL.size(); i < e; ++i) {
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Do you want me to change the style of FOR statement to the above one? ashahid: Do you want me to change the style of FOR statement to the above one?
		RKSimonUnsubmitted Not Done Reply Inline Actions Yes please - since you're touching this code, it might as well be dealt with. It can be done as a NFC pre-commit if you prefer to keep this patch cleaner. RKSimon: Yes please - since you're touching this code, it might as well be dealt with. It can be done as…
Type *Ty = cast<Instruction>(VL[i])->getOperand(0)->getType();		Type *Ty = cast<Instruction>(Val)->getOperand(0)->getType();
if (Ty != SrcTy \|\| !isValidElementType(Ty)) {		if (Ty != SrcTy \|\| !isValidElementType(Ty)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: Gathering casts with different src types.\n");		DEBUG(dbgs() << "SLP: Gathering casts with different src types.\n");
return;		return;
}		}
}		}
newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a vector of casts.\n");		DEBUG(dbgs() << "SLP: added a vector of casts.\n");

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth+1);		buildTree_rec(Operands, Depth+1);
}		}
return;		return;
}		}
case Instruction::ICmp:		case Instruction::ICmp:
case Instruction::FCmp: {		case Instruction::FCmp: {
// Check that all of the compares have the same predicate.		// Check that all of the compares have the same predicate.
CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();		CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();
Type *ComparedTy = cast<Instruction>(VL[0])->getOperand(0)->getType();		Type *ComparedTy = cast<Instruction>(VL[0])->getOperand(0)->getType();
for (unsigned i = 1, e = VL.size(); i < e; ++i) {		for (unsigned i = 1, e = VL.size(); i < e; ++i) {
CmpInst *Cmp = cast<CmpInst>(VL[i]);		CmpInst *Cmp = cast<CmpInst>(VL[i]);
if (Cmp->getPredicate() != P0 \|\|		if (Cmp->getPredicate() != P0 \|\|
Cmp->getOperand(0)->getType() != ComparedTy) {		Cmp->getOperand(0)->getType() != ComparedTy) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: Gathering cmp with different predicate.\n");		DEBUG(dbgs() << "SLP: Gathering cmp with different predicate.\n");
return;		return;
}		}
}		}

newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a vector of compares.\n");		DEBUG(dbgs() << "SLP: added a vector of compares.\n");

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

Show All 15 Lines	switch (Opcode) {
case Instruction::SRem:		case Instruction::SRem:
case Instruction::FRem:		case Instruction::FRem:
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor: {		case Instruction::Xor: {
newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a vector of bin op.\n");		DEBUG(dbgs() << "SLP: added a vector of bin op.\n");

// Sort operands of the instructions so that each side is more likely to		// Sort operands of the instructions so that each side is more likely to
// have the same opcode.		// have the same opcode.
if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {		if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {
ValueList Left, Right;		ValueList Left, Right;
reorderInputsAccordingToOpcode(VL, Left, Right);		reorderInputsAccordingToOpcode(VL, Left, Right);
buildTree_rec(Left, Depth + 1);		buildTree_rec(Left, Depth + 1);
buildTree_rec(Right, Depth + 1);		buildTree_rec(Right, Depth + 1);
return;		return;
}		}

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth+1);		buildTree_rec(Operands, Depth+1);
}		}
return;		return;
}		}
case Instruction::GetElementPtr: {		case Instruction::GetElementPtr: {
// We don't combine GEPs with complicated (nested) indexing.		// We don't combine GEPs with complicated (nested) indexing.
for (unsigned j = 0; j < VL.size(); ++j) {		for (Value *Val : VL) {
		RKSimonUnsubmitted Not Done Reply Inline Actions for (unsigned j = 0, e = VL.size(); j < e; ++j) { RKSimon: for (unsigned j = 0, e = VL.size(); j < e; ++j) {
if (cast<Instruction>(VL[j])->getNumOperands() != 2) {		if (cast<Instruction>(Val)->getNumOperands() != 2) {
DEBUG(dbgs() << "SLP: not-vectorizable GEP (nested indexes).\n");		DEBUG(dbgs() << "SLP: not-vectorizable GEP (nested indexes).\n");
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}

// We can't combine several GEPs into one vector if they operate on		// We can't combine several GEPs into one vector if they operate on
// different types.		// different types.
Type *Ty0 = cast<Instruction>(VL0)->getOperand(0)->getType();		Type *Ty0 = cast<Instruction>(VL0)->getOperand(0)->getType();
for (unsigned j = 0; j < VL.size(); ++j) {		for (Value *Val : VL) {
		RKSimonUnsubmitted Not Done Reply Inline Actions for (unsigned j = 0, e = VL.size(); j < e; ++j) { RKSimon: for (unsigned j = 0, e = VL.size(); j < e; ++j) {
Type *CurTy = cast<Instruction>(VL[j])->getOperand(0)->getType();		Type *CurTy = cast<Instruction>(Val)->getOperand(0)->getType();
if (Ty0 != CurTy) {		if (Ty0 != CurTy) {
DEBUG(dbgs() << "SLP: not-vectorizable GEP (different types).\n");		DEBUG(dbgs() << "SLP: not-vectorizable GEP (different types).\n");
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}

// We don't combine GEPs with non-constant indexes.		// We don't combine GEPs with non-constant indexes.
for (unsigned j = 0; j < VL.size(); ++j) {		for (Value *Val : VL) {
		RKSimonUnsubmitted Not Done Reply Inline Actions for (unsigned j = 0, e = VL.size(); j < e; ++j) { RKSimon: for (unsigned j = 0, e = VL.size(); j < e; ++j) {
auto Op = cast<Instruction>(VL[j])->getOperand(1);		auto Op = cast<Instruction>(Val)->getOperand(1);
if (!isa<ConstantInt>(Op)) {		if (!isa<ConstantInt>(Op)) {
DEBUG(		DEBUG(
dbgs() << "SLP: not-vectorizable GEP (non-constant indexes).\n");		dbgs() << "SLP: not-vectorizable GEP (non-constant indexes).\n");
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}

newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a vector of GEPs.\n");		DEBUG(dbgs() << "SLP: added a vector of GEPs.\n");
for (unsigned i = 0, e = 2; i < e; ++i) {		for (unsigned i = 0, e = 2; i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
return;		return;
}		}
case Instruction::Store: {		case Instruction::Store: {
// Check if the stores are consecutive or of we need to swizzle them.		// Check if the stores are consecutive or of we need to swizzle them.
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i)		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i)
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: Non-consecutive store.\n");		DEBUG(dbgs() << "SLP: Non-consecutive store.\n");
return;		return;
}		}

newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a vector of stores.\n");		DEBUG(dbgs() << "SLP: added a vector of stores.\n");

ValueList Operands;		ValueList Operands;
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(0));		Operands.push_back(cast<Instruction>(j)->getOperand(0));

buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
return;		return;
}		}
case Instruction::Call: {		case Instruction::Call: {
// Check if the calls are all to the same vectorizable intrinsic.		// Check if the calls are all to the same vectorizable intrinsic.
CallInst *CI = cast<CallInst>(VL[0]);		CallInst *CI = cast<CallInst>(VL[0]);
// Check if this is an Intrinsic call or something that can be		// Check if this is an Intrinsic call or something that can be
// represented by an intrinsic call		// represented by an intrinsic call
Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
if (!isTriviallyVectorizable(ID)) {		if (!isTriviallyVectorizable(ID)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: Non-vectorizable call.\n");		DEBUG(dbgs() << "SLP: Non-vectorizable call.\n");
return;		return;
}		}
Function *Int = CI->getCalledFunction();		Function *Int = CI->getCalledFunction();
Value *A1I = nullptr;		Value *A1I = nullptr;
if (hasVectorInstrinsicScalarOpd(ID, 1))		if (hasVectorInstrinsicScalarOpd(ID, 1))
A1I = CI->getArgOperand(1);		A1I = CI->getArgOperand(1);
for (unsigned i = 1, e = VL.size(); i != e; ++i) {		for (unsigned i = 1, e = VL.size(); i != e; ++i) {
CallInst *CI2 = dyn_cast<CallInst>(VL[i]);		CallInst *CI2 = dyn_cast<CallInst>(VL[i]);
if (!CI2 \|\| CI2->getCalledFunction() != Int \|\|		if (!CI2 \|\| CI2->getCalledFunction() != Int \|\|
getVectorIntrinsicIDForCall(CI2, TLI) != ID \|\|		getVectorIntrinsicIDForCall(CI2, TLI) != ID \|\|
!CI->hasIdenticalOperandBundleSchema(*CI2)) {		!CI->hasIdenticalOperandBundleSchema(*CI2)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: mismatched calls:" << CI << "!=" << VL[i]		DEBUG(dbgs() << "SLP: mismatched calls:" << CI << "!=" << VL[i]
<< "\n");		<< "\n");
return;		return;
}		}
// ctlz,cttz and powi are special intrinsics whose second argument		// ctlz,cttz and powi are special intrinsics whose second argument
// should be same in order for them to be vectorized.		// should be same in order for them to be vectorized.
if (hasVectorInstrinsicScalarOpd(ID, 1)) {		if (hasVectorInstrinsicScalarOpd(ID, 1)) {
Value *A1J = CI2->getArgOperand(1);		Value *A1J = CI2->getArgOperand(1);
if (A1I != A1J) {		if (A1I != A1J) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: mismatched arguments in call:" << *CI		DEBUG(dbgs() << "SLP: mismatched arguments in call:" << *CI
<< " argument "<< A1I<<"!=" << A1J		<< " argument "<< A1I<<"!=" << A1J
<< "\n");		<< "\n");
return;		return;
}		}
}		}
// Verify that the bundle operands are identical between the two calls.		// Verify that the bundle operands are identical between the two calls.
if (CI->hasOperandBundles() &&		if (CI->hasOperandBundles() &&
!std::equal(CI->op_begin() + CI->getBundleOperandsStartIndex(),		!std::equal(CI->op_begin() + CI->getBundleOperandsStartIndex(),
CI->op_begin() + CI->getBundleOperandsEndIndex(),		CI->op_begin() + CI->getBundleOperandsEndIndex(),
CI2->op_begin() + CI2->getBundleOperandsStartIndex())) {		CI2->op_begin() + CI2->getBundleOperandsStartIndex())) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: mismatched bundle operands in calls:" << *CI << "!="		DEBUG(dbgs() << "SLP: mismatched bundle operands in calls:" << *CI << "!="
<< *VL[i] << '\n');		<< *VL[i] << '\n');
return;		return;
}		}
}		}

newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
for (unsigned i = 0, e = CI->getNumArgOperands(); i != e; ++i) {		for (unsigned i = 0, e = CI->getNumArgOperands(); i != e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL) {		for (Value *j : VL) {
CallInst *CI2 = dyn_cast<CallInst>(j);		CallInst *CI2 = dyn_cast<CallInst>(j);
Operands.push_back(CI2->getArgOperand(i));		Operands.push_back(CI2->getArgOperand(i));
}		}
buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
return;		return;
}		}
case Instruction::ShuffleVector: {		case Instruction::ShuffleVector: {
// If this is not an alternate sequence of opcode like add-sub		// If this is not an alternate sequence of opcode like add-sub
// then do not vectorize this instruction.		// then do not vectorize this instruction.
if (!isAltShuffle) {		if (!isAltShuffle) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");		DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");
return;		return;
}		}
newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");		DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");

// Reorder operands if reordering would enable vectorization.		// Reorder operands if reordering would enable vectorization.
if (isa<BinaryOperator>(VL0)) {		if (isa<BinaryOperator>(VL0)) {
ValueList Left, Right;		ValueList Left, Right;
reorderAltShuffleOperands(VL, Left, Right);		reorderAltShuffleOperands(VL, Left, Right);
buildTree_rec(Left, Depth + 1);		buildTree_rec(Left, Depth + 1);
buildTree_rec(Right, Depth + 1);		buildTree_rec(Right, Depth + 1);
return;		return;
}		}

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
return;		return;
}		}
default:		default:
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");		DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");
return;		return;
}		}
}		}

unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {		unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {
unsigned N;		unsigned N;
Type *EltTy;		Type *EltTy;
▲ Show 20 Lines • Show All 778 Lines • ▼ Show 20 Lines	Value BoUpSLP::alreadyVectorized(ArrayRef<Value > VL) const {
}		}
return nullptr;		return nullptr;
}		}

Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL) {		Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL) {
if (ScalarToTreeEntry.count(VL[0])) {		if (ScalarToTreeEntry.count(VL[0])) {
int Idx = ScalarToTreeEntry[VL[0]];		int Idx = ScalarToTreeEntry[VL[0]];
TreeEntry *E = &VectorizableTree[Idx];		TreeEntry *E = &VectorizableTree[Idx];
if (E->isSame(VL))		if (E->isSame(VL) \|\| (E->NeedToShuffle && E->isFoundJumbled(VL, DL, SE)))
return vectorizeTree(E);		return vectorizeTree(VL, E);
}		}

Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
VectorType *VecTy = VectorType::get(ScalarTy, VL.size());		VectorType *VecTy = VectorType::get(ScalarTy, VL.size());

return Gather(VL, VecTy);		return Gather(VL, VecTy);
}		}

Value BoUpSLP::vectorizeTree(TreeEntry E) {		Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL, TreeEntry *E) {
IRBuilder<>::InsertPointGuard Guard(Builder);		IRBuilder<>::InsertPointGuard Guard(Builder);

if (E->VectorizedValue) {		if (E->VectorizedValue && !E->NeedToShuffle) {
DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");		DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

Instruction *VL0 = cast<Instruction>(E->Scalars[0]);		Instruction *VL0 = cast<Instruction>(E->Scalars[0]);
Type *ScalarTy = VL0->getType();		Type *ScalarTy = VL0->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL0))		if (StoreInst *SI = dyn_cast<StoreInst>(VL0))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
▲ Show 20 Lines • Show All 221 Lines • ▼ Show 20 Lines	case Instruction::Load: {
unsigned Alignment = LI->getAlignment();		unsigned Alignment = LI->getAlignment();
LI = Builder.CreateLoad(VecPtr);		LI = Builder.CreateLoad(VecPtr);
if (!Alignment) {		if (!Alignment) {
Alignment = DL->getABITypeAlignment(ScalarLoadTy);		Alignment = DL->getABITypeAlignment(ScalarLoadTy);
}		}
LI->setAlignment(Alignment);		LI->setAlignment(Alignment);
E->VectorizedValue = LI;		E->VectorizedValue = LI;
++NumVectorInstructions;		++NumVectorInstructions;
return propagateMetadata(LI, E->Scalars);		propagateMetadata(LI, E->Scalars);

		// As program order of scalar loads are jumbled, the vectorized 'load'
		// must be followed by a 'shuffle' with the required jumbled mask.
		mssimpsoUnsubmitted Not Done Reply Inline Actions I probably missed this, but why are we checking the sizes? Does this mean there will be cases where E->NeedToShuffle is true but we don't generate the shuffle? mssimpso: I probably missed this, but why are we checking the sizes? Does this mean there will be cases…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions No, I want to ensure that resulting vector type is not differing due to the length of the vector value. ashahid: No, I want to ensure that resulting vector type is not differing due to the length of the…
		mssimpsoUnsubmitted Not Done Reply Inline Actions I don't think I fully understand this yet. Can you please make the comment more detailed. In particular, when does VL.size() not equal Scalars.size()? Is this the case when a bundle gets split up into smaller chunks? And then if this is true, what does it imply for the jumbled accesses. It looks like we will end up with a vector load still, but then when are they placed in the right order? Sorry if this should all be obvious! mssimpso: I don't think I fully understand this yet. Can you please make the comment more detailed. In…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions As such I don't expect VL.size() not equal to Scalars.size(), but if it is so, the compiler may throw assertion for incorrect vector types. I just wanted to avoid that. May be I am presuming it. I will check by avoiding this specific check. ashahid: As such I don't expect VL.size() not equal to Scalars.size(), but if it is so, the compiler may…
		mssimpsoUnsubmitted Not Done Reply Inline Actions Should we make the size check an assertion? mssimpso: Should we make the size check an assertion?
		if (!VL.empty() && (E->NeedToShuffle)) {
		assert(VL.size() == E->Scalars.size() &&
		RKSimonUnsubmitted Not Done Reply Inline Actions for (unsigned i = 0, e = VecTy->getNumElements(); i < e; ++i) { RKSimon: for (unsigned i = 0, e = VecTy->getNumElements(); i < e; ++i) {
		RKSimonUnsubmitted Not Done Reply Inline Actions clang-format? RKSimon: clang-format?
		mssimpsoUnsubmitted Not Done Reply Inline Actions Hi Shahid, I'm hitting the assertion here while testing this patch. Can you take a look? mssimpso: Hi Shahid, I'm hitting the assertion here while testing this patch. Can you take a look?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure. If possible can you share the asserting test? ashahid: Sure. If possible can you share the asserting test?
		mssimpsoUnsubmitted Not Done Reply Inline Actions Sure, I'll try and reduce something for you. mssimpso: Sure, I'll try and reduce something for you.
		mssimpsoUnsubmitted Not Done Reply Inline Actions OK, you should be able to reproduce the assert with the bugpoint reduced test case at P7950. Thanks! opt < D26905.ll -slp-vectorizer -S mssimpso: OK, you should be able to reproduce the assert with the bugpoint reduced test case at P7950.
		"Equal number of scalars expected");
		SmallVector<Constant *, 8> Mask;
		for (Value *Val : VL) {
		if (ScalarToTreeEntry.count(Val)) {
		RKSimonUnsubmitted Not Done Reply Inline Actions for range loop? RKSimon: for range loop?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure ashahid: Sure
		int Idx = ScalarToTreeEntry[Val];
		TreeEntry *E = &VectorizableTree[Idx];
		for (unsigned Lane = 0, LE = VL.size(); Lane != LE; ++Lane) {
		if (E->Scalars[Lane] == Val) {
		Mask.push_back(Builder.getInt32(Lane));
		break;
		}
		}
		}
		}

		// Generate shuffle for jumbled memory access
		Value *Undef = UndefValue::get(VecTy);
		Value shuf = Builder.CreateShuffleVector((Value )LI, Undef,
		mkuperUnsubmitted Not Done Reply Inline Actions shuf -> Shuf mkuper: shuf -> Shuf
		ConstantVector::get(Mask));
		return shuf;
		mssimpsoUnsubmitted Not Done Reply Inline Actions I also saw verifier failures where TBAA metadata had been applied to the shuffle, like: TBAA is only for loads, stores and calls! %14 = shufflevector <4 x i32> %13, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1>, !tbaa !66 mssimpso: I also saw verifier failures where TBAA metadata had been applied to the shuffle, like: ```…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Ok, will fix it. ashahid: Ok, will fix it.
		mssimpsoUnsubmitted Not Done Reply Inline Actions I think you should probably just copy the metadata from the scalar load to the vector load, like: propagateMetadata(LI, E->Scalars); return Shuf; mssimpso: I think you should probably just copy the metadata from the scalar load to the vector load…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, TBAA metadata is not for shufflevector and recently verifier added this assert. ashahid: Yes, TBAA metadata is not for shufflevector and recently verifier added this assert.
		}

		return LI;
}		}
case Instruction::Store: {		case Instruction::Store: {
StoreInst *SI = cast<StoreInst>(VL0);		StoreInst *SI = cast<StoreInst>(VL0);
unsigned Alignment = SI->getAlignment();		unsigned Alignment = SI->getAlignment();
unsigned AS = SI->getPointerAddressSpace();		unsigned AS = SI->getPointerAddressSpace();

ValueList ValueOp;		ValueList ValueOp;
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
▲ Show 20 Lines • Show All 158 Lines • ▼ Show 20 Lines
Value *BoUpSLP::vectorizeTree() {		Value *BoUpSLP::vectorizeTree() {

// All blocks must be scheduled before any instructions are inserted.		// All blocks must be scheduled before any instructions are inserted.
for (auto &BSIter : BlocksSchedules) {		for (auto &BSIter : BlocksSchedules) {
scheduleBlock(BSIter.second.get());		scheduleBlock(BSIter.second.get());
}		}

Builder.SetInsertPoint(&F->getEntryBlock().front());		Builder.SetInsertPoint(&F->getEntryBlock().front());
auto *VectorRoot = vectorizeTree(&VectorizableTree[0]);		auto VectorRoot = vectorizeTree(ArrayRef<Value >(), &VectorizableTree[0]);

// If the vectorized tree can be rewritten in a smaller type, we truncate the		// If the vectorized tree can be rewritten in a smaller type, we truncate the
// vectorized root. InstCombine will then rewrite the entire expression. We		// vectorized root. InstCombine will then rewrite the entire expression. We
// sign extend the extracted values below.		// sign extend the extracted values below.
auto *ScalarRoot = VectorizableTree[0].Scalars[0];		auto *ScalarRoot = VectorizableTree[0].Scalars[0];
if (MinBWs.count(ScalarRoot)) {		if (MinBWs.count(ScalarRoot)) {
if (auto *I = dyn_cast<Instruction>(VectorRoot))		if (auto *I = dyn_cast<Instruction>(VectorRoot))
Builder.SetInsertPoint(&*++BasicBlock::iterator(I));		Builder.SetInsertPoint(&*++BasicBlock::iterator(I));
▲ Show 20 Lines • Show All 2,225 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/jumbled-load.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s			; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s

				RKSimonUnsubmitted Not Done Reply Inline Actions Possibly commit this test to trunk with the current output generated by utils/update_test_checks.py? RKSimon: Possibly commit this test to trunk with the current output generated by…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions By "current output" do you mean output generated by utils/update_test_checks.py with this patch by ? ashahid: By "current output" do you mean output generated by utils/update_test_checks.py with this patch…
				RKSimonUnsubmitted Not Done Reply Inline Actions I mean commit the current (pre-patch) codegen so that this patch demonstrates the diff. RKSimon: I mean commit the current (pre-patch) codegen so that this patch demonstrates the diff.


	define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {			define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
	; CHECK-LABEL: @jumbled-load(			; CHECK-LABEL: @jumbled-load(
	; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 %in, i64 0			; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 %in, i64 0
	; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
	; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
				mkuperUnsubmitted Not Done Reply Inline Actions Please add a test that has several load packets (e.g. multiplies one load sequence by another load sequence). mkuper: Please add a test that has several load packets (e.g. multiplies one load sequence by another…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure ashahid: Sure
	; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
				; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
	; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 %inn, i64 0			; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 %inn, i64 0
	; CHECK-NEXT: [[LOAD_5:%.]] = load i32, i32 [[INN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_6:%.]] = load i32, i32 [[GEP_4]], align 4
	; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_7:%.]] = load i32, i32 [[GEP_5]], align 4
	; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_8:%.]] = load i32, i32 [[GEP_6]], align 4			; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_3]], [[LOAD_5]]			; CHECK-NEXT: [[TMP5:%.]] = load <4 x i32>, <4 x i32> [[TMP4]], align 4
	; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_8]]			; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
	; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_4]], [[LOAD_7]]			; CHECK-NEXT: [[TMP7:%.*]] = mul <4 x i32> [[TMP3]], [[TMP6]]
	; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_1]], [[LOAD_6]]
	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 %out, i64 0			; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 %out, i64 0
	; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_7]], align 4
	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 %out, i64 1			; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 %out, i64 1
	; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_8]], align 4
	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 %out, i64 2			; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 %out, i64 2
	; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_9]], align 4
	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 %out, i64 3			; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 %out, i64 3
	; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_10]], align 4			; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	%in.addr = getelementptr inbounds i32, i32* %in, i64 0			%in.addr = getelementptr inbounds i32, i32* %in, i64 0
	%load.1 = load i32, i32* %in.addr, align 4			%load.1 = load i32, i32* %in.addr, align 4
	%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3			%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3
	%load.2 = load i32, i32* %gep.1, align 4			%load.2 = load i32, i32* %gep.1, align 4
	%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1			%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1
	%load.3 = load i32, i32* %gep.2, align 4			%load.3 = load i32, i32* %gep.2, align 4
	%gep.3 = getelementptr inbounds i32, i32* %in.addr, i64 2			%gep.3 = getelementptr inbounds i32, i32* %in.addr, i64 2
	%load.4 = load i32, i32* %gep.3, align 4			%load.4 = load i32, i32* %gep.3, align 4
	%inn.addr = getelementptr inbounds i32, i32* %inn, i64 0			%inn.addr = getelementptr inbounds i32, i32* %inn, i64 0
	%load.5 = load i32, i32* %inn.addr, align 4			%load.5 = load i32, i32* %inn.addr, align 4
	%gep.4 = getelementptr inbounds i32, i32* %inn.addr, i64 2			%gep.4 = getelementptr inbounds i32, i32* %inn.addr, i64 2
	%load.6 = load i32, i32* %gep.4, align 4			%load.6 = load i32, i32* %gep.4, align 4
	%gep.5 = getelementptr inbounds i32, i32* %inn.addr, i64 3			%gep.5 = getelementptr inbounds i32, i32* %inn.addr, i64 3
	%load.7 = load i32, i32* %gep.5, align 4			%load.7 = load i32, i32* %gep.5, align 4
	%gep.6 = getelementptr inbounds i32, i32* %inn.addr, i64 1			%gep.6 = getelementptr inbounds i32, i32* %inn.addr, i64 1
	%load.8 = load i32, i32* %gep.6, align 4			%load.8 = load i32, i32* %gep.6, align 4
	%mul.1 = mul i32 %load.3, %load.5			%mul.1 = mul i32 %load.3, %load.5
	%mul.2 = mul i32 %load.2, %load.8			%mul.2 = mul i32 %load.2, %load.8
	%mul.3 = mul i32 %load.4, %load.7			%mul.3 = mul i32 %load.4, %load.7
	%mul.4 = mul i32 %load.1, %load.6			%mul.4 = mul i32 %load.1, %load.6
	%gep.7 = getelementptr inbounds i32, i32* %out, i64 0			%gep.7 = getelementptr inbounds i32, i32* %out, i64 0
				mkuperUnsubmitted Not Done Reply Inline Actions What happens if the stores are also out of order? (IIRC, we should already have code to deal with that, I just want to make sure it meshes with the stores being out of order correctly) mkuper: What happens if the stores are also out of order? (IIRC, we should already have code to deal…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions I have not checked yet, but I will check. ashahid: I have not checked yet, but I will check.
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions It gels well with 'stores' being out-of-order by generating proper shufflemask for loads according to the out-of-order stores. ashahid: It gels well with 'stores' being out-of-order by generating proper shufflemask for loads…
				mkuperUnsubmitted Not Done Reply Inline Actions I thought you added a test for the combination of out-of-order loads and out-of-order stores, but turns out I was imagining it. Could you please add one? (We should have a regression test making sure we don't generate extra shuffles.) mkuper: I thought you added a test for the combination of out-of-order loads and out-of-order stores…
	store i32 %mul.1, i32* %gep.7, align 4			store i32 %mul.1, i32* %gep.7, align 4
	%gep.8 = getelementptr inbounds i32, i32* %out, i64 1			%gep.8 = getelementptr inbounds i32, i32* %out, i64 1
	store i32 %mul.2, i32* %gep.8, align 4			store i32 %mul.2, i32* %gep.8, align 4
	%gep.9 = getelementptr inbounds i32, i32* %out, i64 2			%gep.9 = getelementptr inbounds i32, i32* %out, i64 2
	store i32 %mul.3, i32* %gep.9, align 4			store i32 %mul.3, i32* %gep.9, align 4
	%gep.10 = getelementptr inbounds i32, i32* %out, i64 3			%gep.10 = getelementptr inbounds i32, i32* %out, i64 3
	store i32 %mul.4, i32* %gep.10, align 4			store i32 %mul.4, i32* %gep.10, align 4

	ret i32 undef			ret i32 undef
	}			}

test/Transforms/SLPVectorizer/X86/reduction_loads.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-apple-macosx10.10.0 -mattr=+sse4.2 \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-apple-macosx10.10.0 -mattr=+sse4.2 \| FileCheck %s


				RKSimonUnsubmitted Not Done Reply Inline Actions It'd be better if there was more context to this additional shuffle - regenerate + commit the current output with utils/update_test_checks.py ? For an IR loop it's not that large an output. RKSimon: It'd be better if there was more context to this additional shuffle - regenerate + commit the…
				mkuperUnsubmitted Not Done Reply Inline Actions Yes, I'd be interested to know if we added a shuffle here, or just moved a shuffle from the store side to the load side (which makes sense). mkuper: Yes, I'd be interested to know if we added a shuffle here, or just moved a shuffle from the…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions By "current output" do you mean output generated by utils/update_test_checks.py with this patch by ? Pls explain. ashahid: By "current output" do you mean output generated by utils/update_test_checks.py with this patch…
	define i32 @test(i32* nocapture readonly %p) {			define i32 @test(i32* nocapture readonly %p) {
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i32, i32 %p, i64 1			; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i32, i32 %p, i64 1
	; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i32, i32 %p, i64 2			; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i32, i32 %p, i64 2
	; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i32, i32 %p, i64 3			; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i32, i32 %p, i64 3
	; CHECK-NEXT: [[ARRAYIDX_4:%.]] = getelementptr inbounds i32, i32 %p, i64 4			; CHECK-NEXT: [[ARRAYIDX_4:%.]] = getelementptr inbounds i32, i32 %p, i64 4
	; CHECK-NEXT: [[ARRAYIDX_5:%.]] = getelementptr inbounds i32, i32 %p, i64 5			; CHECK-NEXT: [[ARRAYIDX_5:%.]] = getelementptr inbounds i32, i32 %p, i64 5
	; CHECK-NEXT: [[ARRAYIDX_6:%.]] = getelementptr inbounds i32, i32 %p, i64 6			; CHECK-NEXT: [[ARRAYIDX_6:%.]] = getelementptr inbounds i32, i32 %p, i64 6
	; CHECK-NEXT: [[ARRAYIDX_7:%.]] = getelementptr inbounds i32, i32 %p, i64 7			; CHECK-NEXT: [[ARRAYIDX_7:%.]] = getelementptr inbounds i32, i32 %p, i64 7
	; CHECK-NEXT: br label %for.body			; CHECK-NEXT: br label %for.body
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[SUM:%.*]] = phi i32 [ 0, %entry ], [ %add.7, %for.body ]			; CHECK-NEXT: [[SUM:%.*]] = phi i32 [ 0, %entry ], [ %add.7, %for.body ]
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 %p to <8 x i32>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 %p to <8 x i32>*
	; CHECK-NEXT: [[TMP1:%.]] = load <8 x i32>, <8 x i32> [[TMP0]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load <8 x i32>, <8 x i32> [[TMP0]], align 4
	; CHECK-NEXT: [[TMP2:%.*]] = mul <8 x i32> <i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42>, [[TMP1]]			; CHECK-NEXT: [[TMP2:%.*]] = mul <8 x i32> <i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42, i32 42>, [[TMP1]]
				RKSimonUnsubmitted Not Done Reply Inline Actions What can be done to avoid this regression? RKSimon: What can be done to avoid this regression?
				mkuperUnsubmitted Not Done Reply Inline Actions Ohh, right, wanted to ask about this as well. My guess is that this wasn't actually a regression, but we moved the shuffle from store side to the load side. Is that right? mkuper: Ohh, right, wanted to ask about this as well. My guess is that this wasn't actually a…
				RKSimonUnsubmitted Not Done Reply Inline Actions If the update_test_checks script has done its job and generated checks for all the IR then this is an additional shuffle, I can't see an equivalent shuffle or set of extracts in the codegen on the left. RKSimon: If the update_test_checks script has done its job and generated checks for all the IR then this…
				mkuperUnsubmitted Not Done Reply Inline Actions Argh, I didn't even look at the new version of the test, my assumption was from looking at the non-generated one (which is even more embarrassing, since I originally wrote that test, and didn't remember it doesn't have a shuffle...) We really should not be regressing this. mkuper: Argh, I didn't even look at the new version of the test, my assumption was from looking at the…
				RKSimonUnsubmitted Not Done Reply Inline Actions Any luck with working out what is causing this regression? Cross lane shuffles can be quite expensive. RKSimon: Any luck with working out what is causing this regression? Cross lane shuffles can be quite…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions The regression is because here the order of scalar loads are reverse consecutive initially. I will update the patch to resolve it. ashahid: The regression is because here the order of scalar loads are reverse consecutive initially. I…
	; CHECK-NEXT: [[ADD:%.*]] = add i32 undef, [[SUM]]			; CHECK-NEXT: [[ADD:%.*]] = add i32 undef, [[SUM]]
	; CHECK-NEXT: [[ADD_1:%.*]] = add i32 undef, [[ADD]]			; CHECK-NEXT: [[ADD_1:%.*]] = add i32 undef, [[ADD]]
	; CHECK-NEXT: [[ADD_2:%.*]] = add i32 undef, [[ADD_1]]			; CHECK-NEXT: [[ADD_2:%.*]] = add i32 undef, [[ADD_1]]
	; CHECK-NEXT: [[ADD_3:%.*]] = add i32 undef, [[ADD_2]]			; CHECK-NEXT: [[ADD_3:%.*]] = add i32 undef, [[ADD_2]]
	; CHECK-NEXT: [[ADD_4:%.*]] = add i32 undef, [[ADD_3]]			; CHECK-NEXT: [[ADD_4:%.*]] = add i32 undef, [[ADD_3]]
	; CHECK-NEXT: [[ADD_5:%.*]] = add i32 undef, [[ADD_4]]			; CHECK-NEXT: [[ADD_5:%.*]] = add i32 undef, [[ADD_4]]
	; CHECK-NEXT: [[ADD_6:%.*]] = add i32 undef, [[ADD_5]]			; CHECK-NEXT: [[ADD_6:%.*]] = add i32 undef, [[ADD_5]]
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <8 x i32> [[TMP2]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <8 x i32> [[TMP2]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX:%.*]] = add <8 x i32> [[TMP2]], [[RDX_SHUF]]			; CHECK-NEXT: [[BIN_RDX:%.*]] = add <8 x i32> [[TMP2]], [[RDX_SHUF]]
	; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <8 x i32> [[BIN_RDX]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <8 x i32> [[BIN_RDX]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX2:%.*]] = add <8 x i32> [[BIN_RDX]], [[RDX_SHUF1]]			; CHECK-NEXT: [[BIN_RDX2:%.*]] = add <8 x i32> [[BIN_RDX]], [[RDX_SHUF1]]
	; CHECK-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <8 x i32> [[BIN_RDX2]], <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <8 x i32> [[BIN_RDX2]], <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX4:%.*]] = add <8 x i32> [[BIN_RDX2]], [[RDX_SHUF3]]			; CHECK-NEXT: [[BIN_RDX4:%.*]] = add <8 x i32> [[BIN_RDX2]], [[RDX_SHUF3]]
	; CHECK-NEXT: [[TMP3:%.*]] = extractelement <8 x i32> [[BIN_RDX4]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = extractelement <8 x i32> [[BIN_RDX4]], i32 0
	; CHECK-NEXT: [[ADD_7:%.*]] = add i32 [[TMP3]], [[SUM]]			; CHECK-NEXT: [[ADD_7:%.*]] = add i32 [[TMP4]], [[SUM]]
				RKSimonUnsubmitted Not Done Reply Inline Actions This looks suspicious - why the lonely change from TMP3 to TMP4? RKSimon: This looks suspicious - why the lonely change from TMP3 to TMP4?
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Oh good catch, I will see. ashahid: Oh good catch, I will see.
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions I was surprised initially but later realized that this is because the current patch resolves the regression you pointed out. So if you compare this patch i.e Diff5 with the previous patch i.e Diff4, you will see the expected difference ashahid: I was surprised initially but later realized that this is because the current patch resolves…
	; CHECK-NEXT: br i1 true, label %for.end, label %for.body			; CHECK-NEXT: br i1 true, label %for.end, label %for.body
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret i32 [[ADD_7]]			; CHECK-NEXT: ret i32 [[ADD_7]]
	;			;
	entry:			entry:
	%arrayidx.1 = getelementptr inbounds i32, i32* %p, i64 1			%arrayidx.1 = getelementptr inbounds i32, i32* %p, i64 1
	%arrayidx.2 = getelementptr inbounds i32, i32* %p, i64 2			%arrayidx.2 = getelementptr inbounds i32, i32* %p, i64 2
	%arrayidx.3 = getelementptr inbounds i32, i32* %p, i64 3			%arrayidx.3 = getelementptr inbounds i32, i32* %p, i64 3
	Show All 37 Lines