This is an archive of the discontinued LLVM Phabricator instance.

While we are at this, let's talk about downstream dependency, if any, for allowing more than one candidate VF along the native path. At least we can write down a list of TODOs so that we'll be aware of the things we need to improve at a time in the future.

This may be a bit more than what you anticipated for this patch, but I think this will help everyone.

lib/Transforms/Vectorize/LoopVectorize.cpp
7098 ↗	(On Diff #184775)	This is in some sense reinventing the wheel, and I kind of imagine this can quickly get out of control if different people start trying to add different constraints of their interest. Let's start talking about how we can get to the point of being able to use unified computeMaxVF(). For this patch, however, I'd be happy if we get to the point of being able to use unified getSmallestAndWidestTypes().
7138 ↗	(On Diff #184775)	Let's try not do this here. Please do this kind of thing inside planInVPlanNativePath().

Thanks for working on this Francesco! +1 to Hideki's point about moving to LoopVectorizationPlanner.

lib/Transforms/Vectorize/LoopVectorize.cpp
1001 ↗	(On Diff #184775)	Is this related to the patch? I suppose guessVectorizationFactor could set UserVF to 1, which could reach this code. It would be better to not vectorize with VF == 1.
3787 ↗	(On Diff #184775)	Related to the patch?
test/Transforms/LoopVectorize/outer_loop_test1_no_explicit_vect_width.ll
90 ↗	(On Diff #184775)	Looks like we are just using a single store width in this test. Maybe it would be worth adding loads/stores to a different type as well?

npanchen added inline comments.Feb 1 2019, 4:08 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
1001 ↗	(On Diff #184775)	For VF = 1 CM is not called, thus this change looks as unrelated to this commit. If it does, what is a reason and why other places in CM are not modified ?
7098 ↗	(On Diff #184775)	Agree with Hideki. Ideally, this function has to return a range [MinVF, MaxVF], which can be used in buildVPlans() to minimize number of built VPlans. Please, remember that target can have different vector registers.
7103 ↗	(On Diff #184775)	For a cases when load and store instructions were hoisted out of the loop, this function will return WidestVectorRegBits, which could be >= 128.
7106 ↗	(On Diff #184775)	getMemInstValueType(I)

kmitropo added a subscriber: kmitropo.Feb 3 2019, 5:51 PM

@hsaito , @fhahn , @npanchen ,

Thank you for looking into this.

I have responded to all comments about the design. I will work on a new patch an submit it according to your suggestions. For now, I have ignored those comments on trivial changes (for example, @fhahn comment on adding more test coverage). I will fix such comments and update them along the way with the re-implementation of the patch.

Thank you,

Francesco

lib/Transforms/Vectorize/LoopVectorize.cpp
1001 ↗	(On Diff #184775)	Yes, this is related to this patch. By guessing the number of lanes with the algorithm i proposed, all the target independent invocations of `opt` that use VPLAN end up reporting vector registers of size 32-bit (`TTI->getRegisterBitWidth(true /* Vector*/)`), which means the vectorization performed in some of the tests are done with a vectorization factor of 1. I am not saying this is the right thing to do, I was actually hacking something that worked to initiate the discussion.
1001 ↗	(On Diff #184775)	This assertion was firing when testing `test/Transforms/LoopVectorize/explicit_outer_detection.ll`. Because the test is run with no explicit vector with (this patch removes such need) and without specifying a target for `opt`, the vectorizer was generating a vector loop with one lane, for the same reason explained in my previsous comment (no target implies `TTI->getRegisterBitWidth(true /* Vector*/)` returning 32).
3787 ↗	(On Diff #184775)	Yes. Again, this was needed because of an assertion firing when building a binary operator in the process of vectorizing `case2` in `test/Transforms/LoopVectorize/explicit_outer_detection.ll`. The loop in `case2` get's vectorized with this patch, with a vectorization factor of 1, but the code generation in the inner loop vectorizer is trying to build a binary operator between a scalar phi and a vector consisting of one lane, which is of course not possible. This change makes sure that the phi is generated not as a scalar but a one lane vector to prevent the failure when building the binary operator.
7098 ↗	(On Diff #184775)	Responding to both comments from @hsaito and @npanchen here: > This is in some sense reinventing the wheel I fully agree :). I needed a starting point to be able to discuss this with you. Let's start talking about how we can get to the point of being able to use unified computeMaxVF(). For this patch, however, I'd be happy if we get to the point of being able to use unified getSmallestAndWidestTypes(). OK, I will look into getting access to getSmallestAndWidestTypes here. Please, remember that target can have different vector registers. Yes, I understand that using the widest ones is not ideal, as you miss for example 2-lane vectorization on floating point data on machines that support both 64-bit and 128-bit vector registers (for example `aarch64`). Just to make sure we are on the same page here. This could be solved by considering all possible power of two vector widths up to the maximum one. For example, if the loop is processing 32-bit data, and the target has 128, 256, and 512-bit wide registers, we should ask VPlan to consider 4, 8 and 16 lanes vectorizations. Is my interpretation correct here?
7103 ↗	(On Diff #184775)	Sorry I don't understand this comment, could you please explain it with and example? Also, given that my understanding is that you want me to use `getSmallestAndWidestTypes`, is the comment still valid? Because I think that by using `getSmallestAndWidestTypes` I essentially will remove my custom code.

Thanks Francesco for helping us remove some of the constraints we have in VPlan native path!
I agree with the idea of using getSmallestAndWidestTypes() as a starting point whereas we don't have the proper cost model.

One more comment below.
Thanks,
Diego

dcaballe added inline comments.Feb 4 2019, 9:31 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
7141 ↗	(On Diff #184775)	Regarding the issue with VF = 1, we are using VF = 1 to denote no vectorization and I think we should preserve that behavior. This basically means we shouldn't try to generate vector code (shouldn't invoke `getWideningDecision` et al.) with VF = 1. I think the problem happens because we are setting `UserVF` here, instead of `VF`, and probably `1` is an unexpected value for `UserVF` (just guessing, I haven't checked it out). Maybe we should leave `UserVF` to values actually coming from the user and set `VF` instead? If you move this code to `planInVPlanNativePath`, as suggested, I think the VF = 1 problem would be fixed.

npanchen added inline comments.Feb 4 2019, 6:05 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
7103 ↗	(On Diff #184775)	The example I was thinking about is pretty simple: #pragma clang loop vectorize(enable) for (i = 0; i < n; ++i) { red += i; } After some optimization the loop body can look like: loop_body: %1 = phi i64 %i_init, %i %2 = phi i32 %red_init, %red %red = %2 + %1 %i = %1 + 1 cmp %i, %n In this case guessVPlanVF() will not find any LD/ST instruction, thus Max will be equal to 1 and WidestVectorRegBits will be returned. For example, in case of SKX WidestVectorRegBits = 512.

Hi all,

I finally addressed your comments and update the patch. Let me know what you think.

Francesco

lib/Transforms/Vectorize/LoopVectorize.cpp
1001 ↗	(On Diff #184775)	I restored this assertion.
3787 ↗	(On Diff #184775)	I restored this code.
7098 ↗	(On Diff #184775)	I have a version that does [MinVF, MaxVF] computation, but this triggers the assertion in this method: void LoopVectorizationPlanner::setBestPlan(unsigned VF, unsigned UF) { LLVM_DEBUG(dbgs() << "Setting best plan to VF=" << VF << ", UF=" << UF << '\n'); BestVF = VF; BestUF = UF; erase_if(VPlans, [VF](const VPlanPtr &Plan) { return !Plan->hasVF(VF); }); assert(VPlans.size() == 1 && "Best VF has not a single VPlan."); } Am I correct thinking that to be able to choose among different plans, we need to provide VPLAN with a cost model that is able to evaluate all the options? If that's the case, I think that generating Max and Min VF goes beyond the scope of this patch.
7103 ↗	(On Diff #184775)	The new version doesn't use my custom code but `getSmallestAndWidestTypes`. I could add a test that does the hoisting on outer loops, but I am not sure how to construct this example. Can you provide a C example to start with? Or are you happy for me to skip this check?
7106 ↗	(On Diff #184775)	This code is not used anymore in the last version.
7141 ↗	(On Diff #184775)	The VF = 1 case is not generated anymore after the last change set, marking this comment as done.
test/Transforms/LoopVectorize/outer_loop_test1_no_explicit_vect_width.ll
90 ↗	(On Diff #184775)	The patch doesn't rely anymore on the load/store types, but uses `getSmallestAndWidestTypes`. Do you still want me to add such test, or can we trust `getSmallestAndWidestTypes` of doing the right job here?

I mostly changes to code to use the infrastructure that LLVM already provides to determine the vectorization factor.

In D57598#1428062, @fpetrogalli wrote:

I mostly changes to code to use the infrastructure that LLVM already provides to determine the vectorization factor.

Haven't gone through the LIT test yet, but the code addressed my concerns.

lib/Transforms/Vectorize/LoopVectorize.cpp
6108 ↗	(On Diff #190488)	If the return value is always 2 or greater, we should replace lines 6099-6103 with assert for VF >= 2. Else, we should move lines 6099-6103 below line 6108 and adjust the code and the comment accordingly.
6112 ↗	(On Diff #190488)	This is no longer user VF. Should be something along the lines of "LV: Using " << UserVF ? "user VF " : "computed VF " << VF

Addressed second round of comments from @hsaito.

Herald added a subscriber: jdoerfert. · View Herald TranscriptMar 13 2019, 8:29 PM

fpetrogalli marked an inline comment as done.Mar 13 2019, 8:30 PM

hsaito added inline comments.Mar 14 2019, 9:51 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
6105 ↗	(On Diff #190562)	(VPlanBuildStressTest && VF < 2) ?

fpetrogalli marked an inline comment as done.Mar 14 2019, 9:56 AM

fpetrogalli added inline comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
6105 ↗	(On Diff #190562)	Sure, but why? If we are in `VPlanBuildStressTest`, why would you care about the value of VF? Isn't it better to have full control on the stress tests and make sure that we always vectorize with VF = 4?

hsaito added inline comments.Mar 14 2019, 10:25 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
6105 ↗	(On Diff #190562)	I then suggest going back to (VPlanBuildStressTest && !UserVF) --- override only if the programmer doesn't explicitly set it. For the spirit of stress testing, I think it makes sense to have the ability to choose any legal VF (and in the future extend it to "scalable"). Maybe, we should take compiler option instead of hard-coded 4. So, something along like the following? if (!UserVF) if (VPlanBuildStressTest) VF = 4 else VF = determineVplanVF()

fpetrogalli updated this revision to Diff 190695.Mar 14 2019, 11:50 AM

fpetrogalli marked 3 inline comments as done.

fpetrogalli added inline comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
6105 ↗	(On Diff #190562)	Thank you for explaining. I have re-worked the if statement as requested.

LGTM. Please wait for a few days to give others a chance to go over your updated patch.

This revision is now accepted and ready to land.Mar 14 2019, 12:19 PM

Meinersbur added a subscriber: Meinersbur.Mar 14 2019, 12:28 PM

Meinersbur added inline comments.

lib/Transforms/Vectorize/LoopVectorizationPlanner.h
233–234 ↗	(On Diff #190562)	[nit] It it somewhat unusual to pass by-value with const.
test/Transforms/LoopVectorize/outer_loop_test1_no_explicit_vect_width.ll
16–17 ↗	(On Diff #190562)	The RUN lines are usually at the top of the file

fpetrogalli marked an inline comment as done.Mar 14 2019, 12:41 PM

fpetrogalli added inline comments.

lib/Transforms/Vectorize/LoopVectorizationPlanner.h
233–234 ↗	(On Diff #190562)	I see, but I run into problems while developing this caused by the fact that I could modify UserVF inside the function. Would a const ref be better? I can try that, if it doesn't work and you don't like this const value, I will revert the interface to the original one.

Meinersbur added inline comments.Mar 14 2019, 1:38 PM

lib/Transforms/Vectorize/LoopVectorizationPlanner.h
233–234 ↗	(On Diff #190562)	It should be a coding standard question. It may help against accidental assignment in the implementation to have a by-value const. Unfortunately, this function implementation detail leaks into the function signature. Since I don't see this anywhere else in the LLVM code base, so I'd prefer to not do it. Otherwise, if we want to apply this consequently, we'd have to add const to by-value parameters to many functions (e.g. `setBestPlan` below). Const ref would be worse.

fpetrogalli updated this revision to Diff 190716.Mar 14 2019, 1:45 PM

fpetrogalli marked 4 inline comments as done.

fpetrogalli added inline comments.

lib/Transforms/Vectorize/LoopVectorizationPlanner.h
233–234 ↗	(On Diff #190562)	Thank you for explaining. I have removed the `const` and restored the original interface..

fpetrogalli marked an inline comment as done and 2 inline comments as not done.Mar 14 2019, 1:48 PM

Thanks, Francesco. LGTM!

@npanchen , @fhahn , gentle ping :)

Francesco

Herald added a subscriber: psnobl. · View Herald TranscriptMar 22 2019, 3:32 PM

In D57598#1440282, @fpetrogalli wrote:

@npanchen , @fhahn , gentle ping :)

Francesco

I think you waited long enough already. I suggest proceeding to commit. Any further comments can be addressed post-commit.

In D57598#1440284, @hsaito wrote:

I think you waited long enough already. I suggest proceeding to commit. Any further comments can be addressed post-commit.

OK.

@hsaito, I don't have commit access, could you commit this change for me?

Thank you!

@hsaito, I don't have commit access, could you commit this change for me?

Will do. I had a day off today. Will take care of it, hopefully tomorrow.

fhahn added inline comments.Mar 27 2019, 3:11 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
6111 ↗	(On Diff #190716)	Is the lambda necessary? You can just have `<< (UserVF ? "user VF" : "computed VF") <<` inline, I think
6114 ↗	(On Diff #190716)	I think it would be clearer to say something along the lines of "LV: Using .. VF to build VPlans", to make it a bit clearer that VF is not necessarily what will be used for vectorization (e.g. VF == 1 means no vectorization).
7157 ↗	(On Diff #190716)	There is VectorizationFactor::Disabled(), which is used in other places. Could you use this here instead of VF.Widht == 1?
test/Transforms/LoopVectorize/explicit_outer_detection.ll
71 ↗	(On Diff #190716)	Comment needs updating, we now analyze loops without user VF too.
test/Transforms/LoopVectorize/outer_loop_test1_no_explicit_vect_width.ll
1 ↗	(On Diff #190716)	This test will fail on targets not built with X86/AArch64 targets. The X86 version should go in test/Transforms/LoopVectorize/X86/ and the AArch64 one in test/Transforms/LoopVectorize/AArch64/

fhahn added inline comments.Mar 27 2019, 3:14 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
6103 ↗	(On Diff #190716)	I think as a follow up, we can drop VPlanBuildStressTest, now that we do not require a UserVF to build VPlans.

I have addressed last round of comments from @fhahn .

Herald added a subscriber: javed.absar. · View Herald TranscriptMar 27 2019, 12:31 PM

fpetrogalli added inline comments.Mar 27 2019, 12:31 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6103 ↗	(On Diff #190716)	Should I address this in a separate patch or this one?
7157 ↗	(On Diff #190716)	`struct VectorizationFactor` did not have a "==" operator. I have added it, and adapted this condition.

I forgot to update the comment in the test as requested from @fhahn . Now it is done.

fpetrogalli marked an inline comment as done.Mar 27 2019, 12:35 PM

Thanks Francesco! I'll commit the change tomorrow, unless @hsaito does it today :)

lib/Transforms/Vectorize/LoopVectorize.cpp
6103 ↗	(On Diff #190716)	Yep that would be great!

In D57598#1445180, @fhahn wrote:

Thanks Francesco! I'll commit the change tomorrow, unless @hsaito does it today :)

Thank you @fhahn !

fhahn added inline comments.Mar 27 2019, 2:34 PM

lib/Transforms/Vectorize/LoopVectorizationPlanner.h
180 ↗	(On Diff #192498)	No braces needed, I'll run clang-format on your patch before committing.

In D57598#1445180, @fhahn wrote:

Thanks Francesco! I'll commit the change tomorrow, unless @hsaito does it today :)

@fhahn, I'm trying to figure out the appropriate proxy setting for external git.
You may be quicker. First commit attempt after SVN is gone. Some learning curve here.

Closed by commit rL357156: [VPlan] Determine Vector Width programmatically. (authored by fhahn). · Explain WhyMar 28 2019, 3:35 AM

This revision was automatically updated to reflect the committed changes.

Thanks Francesco!

In D57598#1445741, @fhahn wrote:

Thanks Francesco!

Thanks, Francesco and Florian.

dcaballe mentioned this in D59952: [VPLAN] Minor improvement to testing and debug messages..Mar 28 2019, 6:47 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorizationPlanner.h

4 lines

LoopVectorize.cpp

53 lines

test/

Transforms/

LoopVectorize/

AArch64/

outer_loop_test1_no_explicit_vect_width.ll

83 lines

X86/

outer_loop_test1_no_explicit_vect_width.ll

114 lines

explicit_outer_detection.ll

12 lines

Diff 192599

llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

	Show First 20 Lines • Show All 168 Lines • ▼ Show 20 Lines
	struct VectorizationFactor {			struct VectorizationFactor {
	// Vector width with best cost			// Vector width with best cost
	unsigned Width;			unsigned Width;
	// Cost of the loop with that width			// Cost of the loop with that width
	unsigned Cost;			unsigned Cost;

	// Width 1 means no vectorization, cost 0 means uncomputed cost.			// Width 1 means no vectorization, cost 0 means uncomputed cost.
	static VectorizationFactor Disabled() { return {1, 0}; }			static VectorizationFactor Disabled() { return {1, 0}; }

				bool operator==(const VectorizationFactor &rhs) const {
				return Width == rhs.Width && Cost == rhs.Cost;
				}
	};			};

	/// Planner drives the vectorization process after having passed			/// Planner drives the vectorization process after having passed
	/// Legality checks.			/// Legality checks.
	class LoopVectorizationPlanner {			class LoopVectorizationPlanner {
	/// The loop that we evaluate.			/// The loop that we evaluate.
	Loop *OrigLoop;			Loop *OrigLoop;

	▲ Show 20 Lines • Show All 101 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,377 Lines • ▼ Show 20 Lines	static bool isExplicitVecOuterLoop(Loop *OuterLp,

Function *Fn = OuterLp->getHeader()->getParent();		Function *Fn = OuterLp->getHeader()->getParent();
if (!Hints.allowVectorization(Fn, OuterLp,		if (!Hints.allowVectorization(Fn, OuterLp,
true /VectorizeOnlyWhenForced/)) {		true /VectorizeOnlyWhenForced/)) {
LLVM_DEBUG(dbgs() << "LV: Loop hints prevent outer loop vectorization.\n");		LLVM_DEBUG(dbgs() << "LV: Loop hints prevent outer loop vectorization.\n");
return false;		return false;
}		}

if (!Hints.getWidth()) {
LLVM_DEBUG(dbgs() << "LV: Not vectorizing: No user vector width.\n");
Hints.emitRemarkWithHints();
return false;
}

if (Hints.getInterleave() > 1) {		if (Hints.getInterleave() > 1) {
// TODO: Interleave support is future work.		// TODO: Interleave support is future work.
LLVM_DEBUG(dbgs() << "LV: Not vectorizing: Interleave is not supported for "		LLVM_DEBUG(dbgs() << "LV: Not vectorizing: Interleave is not supported for "
"outer loops.\n");		"outer loops.\n");
Hints.emitRemarkWithHints();		Hints.emitRemarkWithHints();
return false;		return false;
}		}

▲ Show 20 Lines • Show All 4,676 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectValuesToIgnore() {
// detection.		// detection.
for (auto &Induction : *Legal->getInductionVars()) {		for (auto &Induction : *Legal->getInductionVars()) {
InductionDescriptor &IndDes = Induction.second;		InductionDescriptor &IndDes = Induction.second;
const SmallVectorImpl<Instruction *> &Casts = IndDes.getCastInsts();		const SmallVectorImpl<Instruction *> &Casts = IndDes.getCastInsts();
VecValuesToIgnore.insert(Casts.begin(), Casts.end());		VecValuesToIgnore.insert(Casts.begin(), Casts.end());
}		}
}		}

		// TODO: we could return a pair of values that specify the max VF and
		// min VF, to be used in `buildVPlans(MinVF, MaxVF)` instead of
		// `buildVPlans(VF, VF)`. We cannot do it because VPLAN at the moment
		// doesn't have a cost model that can choose which plan to execute if
		// more than one is generated.
		unsigned determineVPlanVF(const unsigned WidestVectorRegBits,
		LoopVectorizationCostModel &CM) {
		unsigned WidestType;
		std::tie(std::ignore, WidestType) = CM.getSmallestAndWidestTypes();
		return WidestVectorRegBits / WidestType;
		}

VectorizationFactor		VectorizationFactor
LoopVectorizationPlanner::planInVPlanNativePath(bool OptForSize,		LoopVectorizationPlanner::planInVPlanNativePath(bool OptForSize,
unsigned UserVF) {		unsigned UserVF) {
		unsigned VF = UserVF;
// Outer loop handling: They may require CFG and instruction level		// Outer loop handling: They may require CFG and instruction level
// transformations before even evaluating whether vectorization is profitable.		// transformations before even evaluating whether vectorization is profitable.
// Since we cannot modify the incoming IR, we need to build VPlan upfront in		// Since we cannot modify the incoming IR, we need to build VPlan upfront in
// the vectorization pipeline.		// the vectorization pipeline.
if (!OrigLoop->empty()) {		if (!OrigLoop->empty()) {
// TODO: If UserVF is not provided, we set UserVF to 4 for stress testing.		// If the user doesn't provide a vectorization factor, determine a
// This won't be necessary when UserVF is not required in the VPlan-native		// reasonable one.
// path.		if (!UserVF) {
if (VPlanBuildStressTest && !UserVF)		// We set VF to 4 for stress testing.
UserVF = 4;		if (VPlanBuildStressTest)
		VF = 4;
		else
		VF = determineVPlanVF(TTI->getRegisterBitWidth(true /* Vector*/), CM);
		}

assert(EnableVPlanNativePath && "VPlan-native path is not enabled.");		assert(EnableVPlanNativePath && "VPlan-native path is not enabled.");
assert(UserVF && "Expected UserVF for outer loop vectorization.");		assert(isPowerOf2_32(VF) && "VF needs to be a power of two");
assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");		LLVM_DEBUG(dbgs() << "LV: Using " << (UserVF ? "user VF " : "computed VF ")
LLVM_DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");		<< VF << " to build VPlans.\n");
buildVPlans(UserVF, UserVF);		buildVPlans(VF, VF);

// For VPlan build stress testing, we bail out after VPlan construction.		// For VPlan build stress testing, we bail out after VPlan construction.
if (VPlanBuildStressTest)		if (VPlanBuildStressTest)
return VectorizationFactor::Disabled();		return VectorizationFactor::Disabled();

return {UserVF, 0};		return {VF, 0};
}		}

LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Not vectorizing. Inner loops aren't supported in the "		dbgs() << "LV: Not vectorizing. Inner loops aren't supported in the "
"VPlan-native path.\n");		"VPlan-native path.\n");
return VectorizationFactor::Disabled();		return VectorizationFactor::Disabled();
}		}

▲ Show 20 Lines • Show All 1,006 Lines • ▼ Show 20 Lines	static bool processLoopInVPlanNativePath(
LoopVectorizationCostModel CM(L, PSE, LI, LVL, *TTI, TLI, DB, AC, ORE, F,		LoopVectorizationCostModel CM(L, PSE, LI, LVL, *TTI, TLI, DB, AC, ORE, F,
&Hints, IAI);		&Hints, IAI);
// Use the planner for outer loop vectorization.		// Use the planner for outer loop vectorization.
// TODO: CM is not used at this point inside the planner. Turn CM into an		// TODO: CM is not used at this point inside the planner. Turn CM into an
// optional argument if we don't need it in the future.		// optional argument if we don't need it in the future.
LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, CM);		LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, CM);

// Get user vectorization factor.		// Get user vectorization factor.
unsigned UserVF = Hints.getWidth();		const unsigned UserVF = Hints.getWidth();

// Check the function attributes to find out if this function should be		// Check the function attributes to find out if this function should be
// optimized for size.		// optimized for size.
bool OptForSize =		bool OptForSize =
Hints.getForce() != LoopVectorizeHints::FK_Enabled && F->optForSize();		Hints.getForce() != LoopVectorizeHints::FK_Enabled && F->optForSize();

// Plan how to best vectorize, return the best VF and its cost.		// Plan how to best vectorize, return the best VF and its cost.
VectorizationFactor VF = LVP.planInVPlanNativePath(OptForSize, UserVF);		const VectorizationFactor VF = LVP.planInVPlanNativePath(OptForSize, UserVF);

// If we are stress testing VPlan builds, do not attempt to generate vector		// If we are stress testing VPlan builds, do not attempt to generate vector
// code. Masked vector code generation support will follow soon.		// code. Masked vector code generation support will follow soon.
if (VPlanBuildStressTest \|\| EnableVPlanPredication)		// Also, do not attempt to vectorize if no vector code will be produced.
		if (VPlanBuildStressTest \|\| EnableVPlanPredication \|\|
		VectorizationFactor::Disabled() == VF)
return false;		return false;

LVP.setBestPlan(VF.Width, 1);		LVP.setBestPlan(VF.Width, 1);

InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, UserVF, 1, LVL,		InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, 1, LVL,
&CM);		&CM);
LLVM_DEBUG(dbgs() << "Vectorizing outer loop in \""		LLVM_DEBUG(dbgs() << "Vectorizing outer loop in \""
<< L->getHeader()->getParent()->getName() << "\"\n");		<< L->getHeader()->getParent()->getName() << "\"\n");
LVP.executePlan(LB, DT);		LVP.executePlan(LB, DT);

// Mark the loop as already vectorized to avoid vectorizing again.		// Mark the loop as already vectorized to avoid vectorizing again.
Hints.setAlreadyVectorized();		Hints.setAlreadyVectorized();

▲ Show 20 Lines • Show All 433 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/AArch64/outer_loop_test1_no_explicit_vect_width.ll

				; RUN: opt -S -loop-vectorize -enable-vplan-native-path -mtriple aarch64-gnu-linux < %s \| FileCheck %s

				; extern int arr[8][8];
				; extern int arr2[8];
				;
				; void foo(int n)
				; {
				; int i1, i2;
				;
				; #pragma clang loop vectorize(enable)
				; for (i1 = 0; i1 < 8; i1++) {
				; arr2[i1] = i1;
				; for (i2 = 0; i2 < 8; i2++)
				; arr[i2][i1] = i1 + n;
				; }
				; }
				;

				; CHECK-LABEL: vector.ph:
				; CHECK: %[[SplatVal:.*]] = insertelement <4 x i32> undef, i32 %n, i32 0
				; CHECK: %[[Splat:.*]] = shufflevector <4 x i32> %[[SplatVal]], <4 x i32> undef, <4 x i32> zeroinitializer

				; CHECK-LABEL: vector.body:
				; CHECK: %[[Ind:.]] = phi i64 [ 0, %vector.ph ], [ %[[IndNext:.]], %[[ForInc:.*]] ]
				; CHECK: %[[VecInd:.]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %vector.ph ], [ %[[VecIndNext:.]], %[[ForInc]] ]
				; CHECK: %[[AAddr:.]] = getelementptr inbounds [8 x i32], [8 x i32] @arr2, i64 0, <4 x i64> %[[VecInd]]
				; CHECK: %[[VecIndTr:.*]] = trunc <4 x i64> %[[VecInd]] to <4 x i32>
				; CHECK: call void @llvm.masked.scatter.v4i32.v4p0i32(<4 x i32> %[[VecIndTr]], <4 x i32*> %[[AAddr]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>)
				; CHECK: %[[VecIndTr2:.*]] = trunc <4 x i64> %[[VecInd]] to <4 x i32>
				; CHECK: %[[StoreVal:.*]] = add nsw <4 x i32> %[[VecIndTr2]], %[[Splat]]
				; CHECK: br label %[[InnerLoop:.+]]

				; CHECK: [[InnerLoop]]:
				; CHECK: %[[InnerPhi:.]] = phi <4 x i64> [ %[[InnerPhiNext:.]], %[[InnerLoop]] ], [ zeroinitializer, %vector.body ]
				; CHECK: %[[AAddr2:.]] = getelementptr inbounds [8 x [8 x i32]], [8 x [8 x i32]] @arr, i64 0, <4 x i64> %[[InnerPhi]], <4 x i64> %[[VecInd]]
				; CHECK: call void @llvm.masked.scatter.v4i32.v4p0i32(<4 x i32> %[[StoreVal]], <4 x i32*> %[[AAddr2]], i32 4, <4 x i1> <i1 true, i1 true, i1 true
				; CHECK: %[[InnerPhiNext]] = add nuw nsw <4 x i64> %[[InnerPhi]], <i64 1, i64 1, i64 1, i64 1>
				; CHECK: %[[VecCond:.*]] = icmp eq <4 x i64> %[[InnerPhiNext]], <i64 8, i64 8, i64 8, i64 8>
				; CHECK: %[[InnerCond:.*]] = extractelement <4 x i1> %[[VecCond]], i32 0
				; CHECK: br i1 %[[InnerCond]], label %[[ForInc]], label %[[InnerLoop]]

				; CHECK: [[ForInc]]:
				; CHECK: %[[IndNext]] = add i64 %[[Ind]], 4
				; CHECK: %[[VecIndNext]] = add <4 x i64> %[[VecInd]], <i64 4, i64 4, i64 4, i64 4>
				; CHECK: %[[Cmp:.*]] = icmp eq i64 %[[IndNext]], 8
				; CHECK: br i1 %[[Cmp]], label %middle.block, label %vector.body

				@arr2 = external global [8 x i32], align 16
				@arr = external global [8 x [8 x i32]], align 16

				; Function Attrs: norecurse nounwind uwtable
				define void @foo(i32 %n) {
				entry:
				br label %for.body

				for.body: ; preds = %for.inc8, %entry
				%indvars.iv21 = phi i64 [ 0, %entry ], [ %indvars.iv.next22, %for.inc8 ]
				%arrayidx = getelementptr inbounds [8 x i32], [8 x i32]* @arr2, i64 0, i64 %indvars.iv21
				%0 = trunc i64 %indvars.iv21 to i32
				store i32 %0, i32* %arrayidx, align 4
				%1 = trunc i64 %indvars.iv21 to i32
				%add = add nsw i32 %1, %n
				br label %for.body3

				for.body3: ; preds = %for.body3, %for.body
				%indvars.iv = phi i64 [ 0, %for.body ], [ %indvars.iv.next, %for.body3 ]
				%arrayidx7 = getelementptr inbounds [8 x [8 x i32]], [8 x [8 x i32]]* @arr, i64 0, i64 %indvars.iv, i64 %indvars.iv21
				store i32 %add, i32* %arrayidx7, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 8
				br i1 %exitcond, label %for.inc8, label %for.body3

				for.inc8: ; preds = %for.body3
				%indvars.iv.next22 = add nuw nsw i64 %indvars.iv21, 1
				%exitcond23 = icmp eq i64 %indvars.iv.next22, 8
				br i1 %exitcond23, label %for.end10, label %for.body, !llvm.loop !1

				for.end10: ; preds = %for.inc8
				ret void
				}

				!1 = distinct !{!1, !2}
				!2 = !{!"llvm.loop.vectorize.enable", i1 true}

llvm/trunk/test/Transforms/LoopVectorize/X86/outer_loop_test1_no_explicit_vect_width.ll

				; RUN: opt -S -loop-vectorize -enable-vplan-native-path -mtriple x86_64 < %s \| FileCheck %s
				; RUN: opt -S -loop-vectorize -enable-vplan-native-path -mtriple x86_64 -mattr=+avx < %s \| FileCheck %s --check-prefix=AVX
				; RUN: opt -S -loop-vectorize -enable-vplan-native-path -mtriple x86_64 -mattr=+avx2 < %s \| FileCheck %s --check-prefix=AVX

				; extern int arr[8][8];
				; extern int arr2[8];
				;
				; void foo(int n)
				; {
				; int i1, i2;
				;
				; #pragma clang loop vectorize(enable)
				; for (i1 = 0; i1 < 8; i1++) {
				; arr2[i1] = i1;
				; for (i2 = 0; i2 < 8; i2++)
				; arr[i2][i1] = i1 + n;
				; }
				; }
				;

				; CHECK-LABEL: vector.ph:
				; CHECK: %[[SplatVal:.*]] = insertelement <4 x i32> undef, i32 %n, i32 0
				; CHECK: %[[Splat:.*]] = shufflevector <4 x i32> %[[SplatVal]], <4 x i32> undef, <4 x i32> zeroinitializer

				; CHECK-LABEL: vector.body:
				; CHECK: %[[Ind:.]] = phi i64 [ 0, %vector.ph ], [ %[[IndNext:.]], %[[ForInc:.*]] ]
				; CHECK: %[[VecInd:.]] = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %vector.ph ], [ %[[VecIndNext:.]], %[[ForInc]] ]
				; CHECK: %[[AAddr:.]] = getelementptr inbounds [8 x i32], [8 x i32] @arr2, i64 0, <4 x i64> %[[VecInd]]
				; CHECK: %[[VecIndTr:.*]] = trunc <4 x i64> %[[VecInd]] to <4 x i32>
				; CHECK: call void @llvm.masked.scatter.v4i32.v4p0i32(<4 x i32> %[[VecIndTr]], <4 x i32*> %[[AAddr]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>)
				; CHECK: %[[VecIndTr2:.*]] = trunc <4 x i64> %[[VecInd]] to <4 x i32>
				; CHECK: %[[StoreVal:.*]] = add nsw <4 x i32> %[[VecIndTr2]], %[[Splat]]
				; CHECK: br label %[[InnerLoop:.+]]

				; CHECK: [[InnerLoop]]:
				; CHECK: %[[InnerPhi:.]] = phi <4 x i64> [ %[[InnerPhiNext:.]], %[[InnerLoop]] ], [ zeroinitializer, %vector.body ]
				; CHECK: %[[AAddr2:.]] = getelementptr inbounds [8 x [8 x i32]], [8 x [8 x i32]] @arr, i64 0, <4 x i64> %[[InnerPhi]], <4 x i64> %[[VecInd]]
				; CHECK: call void @llvm.masked.scatter.v4i32.v4p0i32(<4 x i32> %[[StoreVal]], <4 x i32*> %[[AAddr2]], i32 4, <4 x i1> <i1 true, i1 true, i1 true
				; CHECK: %[[InnerPhiNext]] = add nuw nsw <4 x i64> %[[InnerPhi]], <i64 1, i64 1, i64 1, i64 1>
				; CHECK: %[[VecCond:.*]] = icmp eq <4 x i64> %[[InnerPhiNext]], <i64 8, i64 8, i64 8, i64 8>
				; CHECK: %[[InnerCond:.*]] = extractelement <4 x i1> %[[VecCond]], i32 0
				; CHECK: br i1 %[[InnerCond]], label %[[ForInc]], label %[[InnerLoop]]

				; CHECK: [[ForInc]]:
				; CHECK: %[[IndNext]] = add i64 %[[Ind]], 4
				; CHECK: %[[VecIndNext]] = add <4 x i64> %[[VecInd]], <i64 4, i64 4, i64 4, i64 4>
				; CHECK: %[[Cmp:.*]] = icmp eq i64 %[[IndNext]], 8
				; CHECK: br i1 %[[Cmp]], label %middle.block, label %vector.body

				; AVX-LABEL: vector.ph:
				; AVX: %[[SplatVal:.*]] = insertelement <8 x i32> undef, i32 %n, i32 0
				; AVX: %[[Splat:.*]] = shufflevector <8 x i32> %[[SplatVal]], <8 x i32> undef, <8 x i32> zeroinitializer

				; AVX-LABEL: vector.body:
				; AVX: %[[Ind:.]] = phi i64 [ 0, %vector.ph ], [ %[[IndNext:.]], %[[ForInc:.*]] ]
				; AVX: %[[VecInd:.]] = phi <8 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, %vector.ph ], [ %[[VecIndNext:.]], %[[ForInc]] ]
				; AVX: %[[AAddr:.]] = getelementptr inbounds [8 x i32], [8 x i32] @arr2, i64 0, <8 x i64> %[[VecInd]]
				; AVX: %[[VecIndTr:.*]] = trunc <8 x i64> %[[VecInd]] to <8 x i32>
				; AVX: call void @llvm.masked.scatter.v8i32.v8p0i32(<8 x i32> %[[VecIndTr]], <8 x i32*> %[[AAddr]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>)
				; AVX: %[[VecIndTr2:.*]] = trunc <8 x i64> %[[VecInd]] to <8 x i32>
				; AVX: %[[StoreVal:.*]] = add nsw <8 x i32> %[[VecIndTr2]], %[[Splat]]
				; AVX: br label %[[InnerLoop:.+]]

				; AVX: [[InnerLoop]]:
				; AVX: %[[InnerPhi:.]] = phi <8 x i64> [ %[[InnerPhiNext:.]], %[[InnerLoop]] ], [ zeroinitializer, %vector.body ]
				; AVX: %[[AAddr2:.]] = getelementptr inbounds [8 x [8 x i32]], [8 x [8 x i32]] @arr, i64 0, <8 x i64> %[[InnerPhi]], <8 x i64> %[[VecInd]]
				; AVX: call void @llvm.masked.scatter.v8i32.v8p0i32(<8 x i32> %[[StoreVal]], <8 x i32*> %[[AAddr2]], i32 4, <8 x i1> <i1 true, i1 true, i1 true
				; AVX: %[[InnerPhiNext]] = add nuw nsw <8 x i64> %[[InnerPhi]], <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>
				; AVX: %[[VecCond:.*]] = icmp eq <8 x i64> %[[InnerPhiNext]], <i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8>
				; AVX: %[[InnerCond:.*]] = extractelement <8 x i1> %[[VecCond]], i32 0
				; AVX: br i1 %[[InnerCond]], label %[[ForInc]], label %[[InnerLoop]]

				; AVX: [[ForInc]]:
				; AVX: %[[IndNext]] = add i64 %[[Ind]], 8
				; AVX: %[[VecIndNext]] = add <8 x i64> %[[VecInd]], <i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8>
				; AVX: %[[Cmp:.*]] = icmp eq i64 %[[IndNext]], 8
				; AVX: br i1 %[[Cmp]], label %middle.block, label %vector.body

				@arr2 = external global [8 x i32], align 16
				@arr = external global [8 x [8 x i32]], align 16

				; Function Attrs: norecurse nounwind uwtable
				define void @foo(i32 %n) {
				entry:
				br label %for.body

				for.body: ; preds = %for.inc8, %entry
				%indvars.iv21 = phi i64 [ 0, %entry ], [ %indvars.iv.next22, %for.inc8 ]
				%arrayidx = getelementptr inbounds [8 x i32], [8 x i32]* @arr2, i64 0, i64 %indvars.iv21
				%0 = trunc i64 %indvars.iv21 to i32
				store i32 %0, i32* %arrayidx, align 4
				%1 = trunc i64 %indvars.iv21 to i32
				%add = add nsw i32 %1, %n
				br label %for.body3

				for.body3: ; preds = %for.body3, %for.body
				%indvars.iv = phi i64 [ 0, %for.body ], [ %indvars.iv.next, %for.body3 ]
				%arrayidx7 = getelementptr inbounds [8 x [8 x i32]], [8 x [8 x i32]]* @arr, i64 0, i64 %indvars.iv, i64 %indvars.iv21
				store i32 %add, i32* %arrayidx7, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 8
				br i1 %exitcond, label %for.inc8, label %for.body3

				for.inc8: ; preds = %for.body3
				%indvars.iv.next22 = add nuw nsw i64 %indvars.iv21, 1
				%exitcond23 = icmp eq i64 %indvars.iv.next22, 8
				br i1 %exitcond23, label %for.end10, label %for.body, !llvm.loop !1

				for.end10: ; preds = %for.inc8
				ret void
				}

				!1 = distinct !{!1, !2}
				!2 = !{!"llvm.loop.vectorize.enable", i1 true}

llvm/trunk/test/Transforms/LoopVectorize/explicit_outer_detection.ll

Show All 15 Lines
; }		; }
; }		; }

; Case 1: Annotated outer loop WITH vector width information must be collected.		; Case 1: Annotated outer loop WITH vector width information must be collected.

; CHECK-LABEL: vector_width		; CHECK-LABEL: vector_width
; CHECK: LV: Loop hints: force=enabled width=4 unroll=0		; CHECK: LV: Loop hints: force=enabled width=4 unroll=0
; CHECK: LV: We can vectorize this outer loop!		; CHECK: LV: We can vectorize this outer loop!
; CHECK: LV: Using user VF 4.		; CHECK: LV: Using user VF 4 to build VPlans.
; CHECK-NOT: LV: Loop hints: force=?		; CHECK-NOT: LV: Loop hints: force=?
; CHECK-NOT: LV: Found a loop: inner.body		; CHECK-NOT: LV: Found a loop: inner.body

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

define void @vector_width(i32* nocapture %a, i32* nocapture readonly %b, i32 %N, i32 %M) local_unnamed_addr {		define void @vector_width(i32* nocapture %a, i32* nocapture readonly %b, i32 %N, i32 %M) local_unnamed_addr {
entry:		entry:
%cmp32 = icmp sgt i32 %N, 0		%cmp32 = icmp sgt i32 %N, 0
Show All 30 Lines	outer.inc: ; preds = %inner.body, %outer.body
%indvars.iv.next36 = add nuw nsw i64 %indvars.iv35, 1		%indvars.iv.next36 = add nuw nsw i64 %indvars.iv35, 1
%exitcond39 = icmp eq i64 %indvars.iv.next36, %wide.trip.count38		%exitcond39 = icmp eq i64 %indvars.iv.next36, %wide.trip.count38
br i1 %exitcond39, label %for.end15, label %outer.body, !llvm.loop !6		br i1 %exitcond39, label %for.end15, label %outer.body, !llvm.loop !6

for.end15: ; preds = %outer.inc, %entry		for.end15: ; preds = %outer.inc, %entry
ret void		ret void
}		}

; Case 2: Annotated outer loop WITHOUT vector width information doesn't have to		; Case 2: Annotated outer loop WITHOUT vector width information must be collected.
; be collected.

; CHECK-LABEL: case2		; CHECK-LABEL: case2
; CHECK-NOT: LV: Loop hints: force=enabled		; CHECK: LV: Loop hints: force=enabled width=0 unroll=0
; CHECK-NOT: LV: We can vectorize this outer loop!		; CHECK: LV: We can vectorize this outer loop!
; CHECK: LV: Loop hints: force=?		; CHECK: LV: Using computed VF 1 to build VPlans.
; CHECK: LV: Found a loop: inner.body

define void @case2(i32* nocapture %a, i32* nocapture readonly %b, i32 %N, i32 %M) local_unnamed_addr {		define void @case2(i32* nocapture %a, i32* nocapture readonly %b, i32 %N, i32 %M) local_unnamed_addr {
entry:		entry:
%cmp32 = icmp sgt i32 %N, 0		%cmp32 = icmp sgt i32 %N, 0
br i1 %cmp32, label %outer.ph, label %for.end15		br i1 %cmp32, label %outer.ph, label %for.end15

outer.ph: ; preds = %entry		outer.ph: ; preds = %entry
%cmp230 = icmp sgt i32 %M, 0		%cmp230 = icmp sgt i32 %M, 0
▲ Show 20 Lines • Show All 152 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[VPLAN] Determine Vector Width programmatically.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 192599

llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/trunk/test/Transforms/LoopVectorize/AArch64/outer_loop_test1_no_explicit_vect_width.ll

llvm/trunk/test/Transforms/LoopVectorize/X86/outer_loop_test1_no_explicit_vect_width.ll

llvm/trunk/test/Transforms/LoopVectorize/explicit_outer_detection.ll

[VPLAN] Determine Vector Width programmatically.
ClosedPublic