This is an archive of the discontinued LLVM Phabricator instance.

[Loop Vectorizer] Interleave vs Gather - in some cases Gather is better.
ClosedPublic

Authored by delena on Dec 19 2016, 7:10 AM.

Download Raw Diff

Details

Reviewers

anemet
Ayal
mkuper
mssimpso

Commits

rG5267edd3e390: [Loop Vectorizer] Cost-based decision for vectorization form of memory…
rL294503: [Loop Vectorizer] Cost-based decision for vectorization form of memory…

Summary

The bug is described in PR31426.
The cost of Load instruction is calculated in the following order isConsecultive - isInterleave - isGather - scalar.
When a Load instruction belongs to Interleave group, the "Gather" option is not checked at all. But when the interleave factor exceeds the maximum, the cost is high and the "Gather" is preferred in this case. The following loop is not vectorized on AVX-512 due to this bug:
for (i=0; i<N; ++i)

B[i] = A[i*5]

Diff Detail

Repository: rL LLVM

Event Timeline

delena updated this revision to Diff 81947.Dec 19 2016, 7:10 AM

delena retitled this revision from to [Loop Vectorizer] Interleave vs Gather - in some cases Gather is better..

delena updated this object.

delena added reviewers: mkuper, Ayal, anemet.

delena set the repository for this revision to rL LLVM.

delena added a subscriber: llvm-commits.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptDec 19 2016, 7:10 AM

Added some comments.

mssimpso added a subscriber: mssimpso.Dec 19 2016, 7:44 AM

mssimpso added inline comments.Dec 19 2016, 9:23 AM

../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7047–7049 ↗	(On Diff #81949)	Why don't you just compare the costs? You wouldn't need to make this assumption anymore.

mkuper added inline comments.Dec 19 2016, 10:11 AM

../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7052 ↗	(On Diff #81949)	Aren't we already checking this in selectInterleaveCount()? How do we end up with interleave factors above MaxInterleaveFactor in the first place?

mssimpso added inline comments.Dec 19 2016, 10:25 AM

../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7052 ↗	(On Diff #81949)	Ah, our naming conventions have confused this somewhat, I think. TTI.getMaxInterleaveFacor is the hook for the max unroll factor ("interleaving"). I think Elana was wanting TLI.getMaxSupportedInterleaveFactor instead. This is the hook for determining the max factor of the interleaved access groups.

mkuper added inline comments.Dec 19 2016, 10:31 AM

../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7052 ↗	(On Diff #81949)	Argh, right. Sorry for the stupid question, I just looked at this and went "this looks odd" without actually reading the context. But yes, Elena, you want the other function. Regardless, can we maybe change the naming to something sane? As a separate patch, of course. (I don't have any good ideas, though.)

delena added inline comments.Dec 20 2016, 6:04 AM

../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7047–7049 ↗	(On Diff #81949)	The cost that we provide for interleaved access is incorrect, specially for AVX-512. AVX-512 has 3-src shuffles and the real cost is much lower . I can't compare it to Gather - the Gather cost wins today, even for small stride, but it is not true. So there are 2 bugs: (1) The loop is scalarized and Gather/Scatter option is not considered at all (2) Incorrect cost for interleaving I can start from providing a correct cost for interleaving on AVX-512. Or I can fix the (1) first of all. I'll retrieve the proper "MaxSupportedInterleaveFactor". What do you think?

mkuper added inline comments.Dec 20 2016, 11:13 AM

../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7047–7049 ↗	(On Diff #81949)	I think it would be better to fix the cost model first. It's very pessimistic for x86 in general, not AVX-512, but you're right, it's even worse for AVX-512, because the real cost is lower. But I thought Farhana was already working on that. Am I confused?

Now, when the interleaving cost calculation is correct, I compare the cost of 3 possible options - interleave, gather and scalar and choose the better option.
The decision made by cost model should be saved and latter used when we generate the vector code.

The code is refactored in order to allow this comparison, but the comparison is the *only* functional change in the code. The rest of the logic remains the same.

I added a test case that demonstrates gather - vs - interleave case.

Merged the patch with the latest changes in LV.

Ping*

Hi Elena,

This patch causes a crash in spec2006/povray on AArch64. I've pasted a test case over at P7951. The problem has to do with the analysis in collectLoopUniforms and the new decision to scalarize. collectLoopUniforms is very conservative about what instructions remain uniform after vectorization. If a memory access has the possibility of being scalarized (even though it may not be), it's pointer operand is not marked uniform. What's happening here is that you've introduced a new scalarization decision based on the cost model that collectLoopUniforms doesn't know about (and likely can't know about). In the test case, collectLoopUniforms marks the pointer operands of the loads uniform even though the loads are now scalarized, which causes the pointer operands to be non-uniform.

I'm not really sure what the best way to fix this would be. Any thoughts? One way would be to remove scalarization from your widening decision.

mssimpso added a reviewer: mssimpso.Jan 12 2017, 10:49 AM

mssimpso removed a subscriber: mssimpso.

In D27919#644255, @mssimpso wrote:

Hi Elena,

I'm not really sure what the best way to fix this would be. Any thoughts? One way would be to remove scalarization from your widening decision.

That sounds a bit unfortunate. I think we'd rather like to the cost model to be able to say "please scalarize this".

../lib/Transforms/Vectorize/LoopVectorize.cpp
1702 ↗	(On Diff #83607)	I'm not a fan of the name of this enum, but don't really have any good ideas. Anyone else?
1725 ↗	(On Diff #83607)	This looks like a suboptimal way to iterate over InterleaveGroup, because it requires a lookup for every index, instead of just iterating above the Members collection. (we don't care about the order here, right?) Perhaps add a better interface?
1731 ↗	(On Diff #83607)	Is this really a per-instruction thing, or do we just always end up with LV_NONE when the cost model is not in use (e.g. explicit user-provided VF). If the former, when does this happen? If the latter, then I think the comment is a bit misleading - and it would be good to be able to assert on this happening only when there's no cost model. Edit: Oh, I think I see, you care about temporary LV_NONE values for Interleave groups. I still think it'd be nice if we could somehow distinguish between the cases, so we could assert on not having a cost when we should.
2878 ↗	(On Diff #83607)	Extra space after =.
2891 ↗	(On Diff #83607)	assert(!Legal->memoryInstructionMustBeScalarized(Instr, VF)), maybe? Because now we leave the door open for the cost model decision being to widen, even though memoryInstructionMustBeScalarized(). Another option is to replace the if with: if (Decision == LoopVectorizationLegality::LV_SCALARIZE \|\| Legal->memoryInstructionMustBeScalarized(Instr, VF)) But I'm not sure it makes sense, because there's no good reason to always call memoryInstructionMustBeScalarized() in a non-asserts build - in theory, the cost model should have already checked this.
7168 ↗	(On Diff #83607)	Wouldn't you calculate it several times per group if it's unprofitable?
../test/Analysis/CostModel/X86/interleave-load-i32.ll
13 ↗	(On Diff #83607)	What happened to the VF = 1 cost?

In D27919#644255, @mssimpso wrote:

Hi Elena,

This patch causes a crash in spec2006/povray on AArch64. I've pasted a test case over at P7951. The problem has to do with the analysis in collectLoopUniforms and the new decision to scalarize. collectLoopUniforms is very conservative about what instructions remain uniform after vectorization. If a memory access has the possibility of being scalarized (even though it may not be), it's pointer operand is not marked uniform. What's happening here is that you've introduced a new scalarization decision based on the cost model that collectLoopUniforms doesn't know about (and likely can't know about). In the test case, collectLoopUniforms marks the pointer operands of the loads uniform even though the loads are now scalarized, which causes the pointer operands to be non-uniform.

I'm not really sure what the best way to fix this would be. Any thoughts? One way would be to remove scalarization from your widening decision.

I'm trying to revisit collections of Uniforms and Scalars after cost estimation and remove GEPs and induction variables if we decided to scalarize. Not finished yet..

../lib/Transforms/Vectorize/LoopVectorize.cpp
1702 ↗	(On Diff #83607)	CM_DECISION_NONE, CM_DECISION_WIDEN .. ?
1725 ↗	(On Diff #83607)	I agree, but iteration inside Factor is not a big overhead and it used in one more place, at least. This patch is complex enough, I'd postpone unrelated changes to another patch.
2891 ↗	(On Diff #83607)	The form I used just prevents a redundant call to memoryInstructionMustBeScalarized(). Because we may have a decision (INTERLEAVE, WIDEN or GATHER_SCATTER) at this stage.
7168 ↗	(On Diff #83607)	LV_NONE means "not calculated yet". If we are here, each memory instruction will have " a decision".
../test/Analysis/CostModel/X86/interleave-load-i32.ll
13 ↗	(On Diff #83607)	There is no group for VF 1. Instruction cost is still there.

In D27919#647975, @delena wrote:

I'm trying to revisit collections of Uniforms and Scalars after cost estimation and remove GEPs and induction variables if we decided to scalarize. Not finished yet..

That sounds like a good idea to me, and I hope we can separate the uniform/scalar collection from the cost estimation in a way that makes sense. When I looked at doing that a while back, the complication I ran into was that the cost estimates depend on knowing the uniforms. But if we use the cost estimates to help determine the uniforms, then we end up with a weird circular dependence. If we mark something uniform/scalar after computing the costs, we will need to update the costs somehow. That in turn might change what we would like to scalarize, etc.

I'm revisiting the set of Scalars and Uniforms after cost modeling. Taking into account that the cost model uses Uniforms and the Uniforms are changed after the cost modeling, I still think that there is no circular dependency here.
(1) I remove uniform GEP (and the corresponding induction) when we decide to scalarize memory instruction.
(2) I do not remove GEP from scalars if it will not be used in Gather/Scatter.

I added a test that Matthew sent me.

mssimpso added inline comments.Jan 18 2017, 9:24 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009 ↗	(On Diff #84811)	Hi Elena, I had been thinking about the use of isUniformAfterVectorization() here in getInstructionCost(). Wouldn't it now be possible for the set of uniforms to differ from the first collection (before VF selection) and the second collection (after VF selection)? So we would choose a VF based on costs assuming an instruction may or may not be uniform. Then we could later reverse our initial decision about the instruction's uniformity after VF selection, making the total cost on which we based our VF decision inaccurate. Or am I missing something? I haven't yet thought through the implications of this in enough detail to know whether this would matter much or not.

delena added inline comments.Jan 19 2017, 4:40 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009 ↗	(On Diff #84811)	About the list of Uniforms. We insert and then remove only GEPs and Induction variables. We do not calculate cost for them anyway. All other Uniform values stay in place. So, the cost is accurate at the end. There is no circular dependency here.

mssimpso added inline comments.Jan 19 2017, 5:18 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009 ↗	(On Diff #84811)	I don't think this is true in general. We mark an instruction uniform if all its users are uniform. So for example, if we have a uniform GEP whose index is some computation, that computation is also uniform if it's only used by the GEP. I think we have some examples in induction.ll, but something like this: %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ] %sum = add i64 %i, %x %idx = getelementptr inbounds float, float* %a, i64 %sum load float, float* %idx, align 4 The GEP is consecutive, so it will be marked uniform. %sum will aslo be marked uniform because it's only used by the GEP. If we later decide to scalarize the load, the GEP, the IV, and %sum will all no longer be uniform. So the cost for %sum will have been wrong.

mssimpso added inline comments.Jan 19 2017, 6:19 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009 ↗	(On Diff #84811)	Just a thought - why not recompute and cache the uniforms (and possibly scalars) for each VF we compute costs for? That would avoid any potential logical inconsistencies. I think the compile-time overhead would probably be minimal (and you're already computing these sets twice anyway).

delena added inline comments.Jan 19 2017, 7:18 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009 ↗	(On Diff #84811)	Just talked with Ayal about this. I can collect Uniforms after making decision about Load/Store intructions. And the decision is based on cost. The decision affects another instructions inside the loop, as you've pointed before. Theoretically, if I have N variants of representing all memory instructions inside the loop, I should examine 2**N combinations per VF. Ayal proposed the following sequence, which should be done on CM stage, after legality is finished: Per VF: Go through all memory insts and make CM decision Build Uniforms and Scalars per VF (that's what you say now) Calculate cost for VF, based on Uniforms and Scalars It is still not ideal, but, probably better than what we have.

mssimpso added inline comments.Jan 19 2017, 7:44 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009 ↗	(On Diff #84811)	Ayal's sequence makes sense to me. We should probably also try to move Uniforms/Scalars and related functions over to LoopVectorizationCostModel rather than keep them in LoopVectorizationLegality, as they will now be a function of Cost/VF and would be more appropriate there in my opinion. This could probably be a separate patch.

I moved Uniforms and Scalars from Legality to the Cost Model. Now we collect Uniforms and Scalars per VF and these collections depend on widening decisions that CM takes for Load/Store instructions.

The patch is big, but I did not see how to split it into separate patches.

delena added a subscriber: dorit.Jan 26 2017, 12:21 AM

mkuper mentioned this in D28975: [LV] Introducing VPlan to model the vectorized code and drive its transformation.Jan 26 2017, 4:50 PM

Ping *
We are in sync with Ayal and Gil working on VPlan.

Hi Elena,

I'll take a look at this again today. Thanks for the reminder!

Hi Elena,

Thanks for your patience. I haven't yet looked in detail at the widening decision selection, but here are some comments around the uniforms/scalars.

Matt.

../lib/Transforms/Vectorize/LoopVectorize.cpp
5616–5619 ↗	(On Diff #85764)	This is not true now, right? An interleaved access may be scalarized based on the cost model.
5625–5626 ↗	(On Diff #85764)	Can we change this to something like: if (Uniforms.count(VF)) return; auto &UniformsVF = Uniforms[VF]; We want to distinguish the case that (1) Uniforms have not been computed for VF from (2) Uniforms have been computed for VF but there aren't any, so we don't need to compute them again. We can end up calling this twice for the same VF if we have a user-selected VF and then compute the expected cost for interleaving. This is similar to the way we do this check in collectInstsToScalarize. This will also apply to collectLoopScalars.
5687 ↗	(On Diff #85764)	Why not roll this check into memoryInstructionMustBeScalarized? Either way, I think this check and the check below in isVectorizedMemAccessUse that calls memoryInstructionMustBeScalarized, should be the same.
7007 ↗	(On Diff #85764)	The VF > 1 check is not needed because you check that condition in isUniformAfterVectorization.
7022 ↗	(On Diff #85764)	This should be VF == 1, since we can't have a zero VF.
7466–7479 ↗	(On Diff #85764)	Can we just delete this in favor of a helper function that checks VecValuesToIgnore and IsScalarAfterVectorization for a given VF? Something like: bool LoopVectorizationCostModel::shouldIgnoreVecValueInCostModel(Instruction *I, unsigned VF) { return VecValuesToIgnore.count(I) \|\| isScalarAfterVectorization(I, VF); } This way we won't have to be imprecise.

delena marked 3 inline comments as done.Jan 31 2017, 7:22 AM

delena added inline comments.

../lib/Transforms/Vectorize/LoopVectorize.cpp
5616–5619 ↗	(On Diff #85764)	It is still true. Instruction must be scalarized if there is no any other option. This is "legality" check, cost model comes later.
5625–5626 ↗	(On Diff #85764)	Yes. I've changed.
5687 ↗	(On Diff #85764)	Yes, I don't need to check legality any more. It is already implied in CM decision.
7007 ↗	(On Diff #85764)	Yes. You are right, thanks!
7466–7479 ↗	(On Diff #85764)	Yes, it is possible.

Updated, following Matthew's comments.

Hi Elena,

Here are some more inline comments. Also, please clang-format the patch if you haven't already done so to make the review easier. Thanks!

../lib/Transforms/Vectorize/LoopVectorize.cpp
1930 ↗	(On Diff #86431)	How are we using CM_DECISION_NONE? Aren't we forced to make some sort of decision? I would think the default would be CM_DECISION_WIDEN unless the cost model points to one of the others.
2054–2055 ↗	(On Diff #86431)	The VF > 2 comment can be removed now.
2062–2063 ↗	(On Diff #86431)	The VF > 2 comment can be removed now.
5503–5504 ↗	(On Diff #86431)	Since you've added a new check in collectUnifomsAndScalars to ensure the analysis is performed only once, the check here and in collectLoopUniforms is redundant. Should these be asserts now?
5556–5563 ↗	(On Diff #86431)	I don't think a see a real use for this function anymore. Please see my related comment about memoryAccessMustBeScalarized. This function was only ever used in collectLoopUniforms to help determine how a memory access would be vectorized. I think you can probably greatly simplify the logic in collectLoopUniforms and remove it. Now, we know what the vectorizer will do based the the cost model decision. For a GEP to remain uniform, I think we just need to know that all its users are CM_DECISION_INTERLEAVE or CM_DECISION_WIDEN. Is this right? If so, please work it into the GEP part of collectLoopUniforms.
5616–5619 ↗	(On Diff #85764)	I think the cost vs legality distinction is not important here. My original intent with this function was just to consolidate all the scalarization conditions for a given access. That way it could be called when collecting uniforms, computing costs, and vectorizing and they all would agree on what would happen. Perhaps the choice of name was misleading? In any case, unless I'm missing something, I don't think you actually use this function anymore. You've replaced all uses with getWideningDecision(I, VF) == CM_DECISION_SCALARIZE This makes sense because I think you've moved all the non-cost related scalarization decisions into the inverse: memoryInstructionCanBeWidened. Can this function be deleted now?

In D27919#662246, @mssimpso wrote:

Hi Elena,

Here are some more inline comments. Also, please clang-format the patch if you haven't already done so to make the review easier. Thanks!

Done.

../lib/Transforms/Vectorize/LoopVectorize.cpp
1930 ↗	(On Diff #86431)	I used it while capturing decisions for an interleave group. At this point I can get rid of it. I return "NONE" if no decision, and it is convenient. InstWidening getWideningDecision(Instruction I, unsigned VF) { assert(VF >= 2 && "Expected VF >=2"); std::pair<Instruction , unsigned> InstOnVF = std::make_pair(I, VF); if (!WideningDecisions.count(InstOnVF)) return CM_DECISION_NONE; return WideningDecisions[InstOnVF]; } Then I use it in assertion, to make sure that decision is made.
5503–5504 ↗	(On Diff #86431)	May be. Until somebody will decide to call these functions separately. Right now I don't see a reason. I'll put an "assert" and add a comment.
5556–5563 ↗	(On Diff #86431)	I removed mustBeScalarized(). (I suppose you meant this function, the lines are mixed up)

delena updated this revision to Diff 86636.Feb 1 2017, 7:44 AM

mssimpso added inline comments.Feb 1 2017, 7:54 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
5556–5563 ↗	(On Diff #86431)	In this comment I was talking about hasConsecutiveLikePtrOperand. I don't think it's needed anymore and can probably be deleted. It's only used in collectLoopUniforms to help determine if a GEP will remain uniform. But can't we now determine that much easier by just checking the CM_DECISION of the accesses? Does that make sense?

I removed hasConsecutiveLikePtrOperand and simplified the code.

Hi Elena,

I like the direction this patch is going. Thanks for all your work. Here are some more comments inline.

../lib/Transforms/Vectorize/LoopVectorize.cpp
7053 ↗	(On Diff #86780)	Can we avoid the divisions here and below and use multiplication instead? We won't have any round-off issues that way. What about something like: unsigned Accesses = 1; if (Legal->isAccessInterleaved(&I)) { Accesses = Group->getNumMembers(); ... } ... ScalarizationCost *= Accesses; ...
7072–7080 ↗	(On Diff #86780)	I think we're fairly consistent in other places where we compare costs, that we prefer the scalar version if there's no benefit for vectorization. So if the scalarization cost is <= the other costs, I think we should go with that.
7255–7284 ↗	(On Diff #86780)	Can't this all be replaced by a call to getInterleaveGroupCost()?
7291–7314 ↗	(On Diff #86780)	Can't this all be replaced by a call to getMemInstScalarizationCost()?
7323–7327 ↗	(On Diff #86780)	Similar comment to the above. I don't think you have a helper for computing gather/scatter cost, but I think it would be nice. It would be easier to keep getInstructionCost in sync with setCostBasedWideningDecision.

delena marked an inline comment as done.Feb 6 2017, 4:44 AM

delena added inline comments.

../lib/Transforms/Vectorize/LoopVectorize.cpp
7072–7080 ↗	(On Diff #86780)	Agree. I've changed the comparison.
7255–7284 ↗	(On Diff #86780)	I reached the conclusion that we don't need to calculate the cost again. We can keep it together with widening decision.

The memory instruction cost that was calculated during widening decision is saved in WideningDecisions map in order to avoid recalculation.

Hi Elena,

I have one comment about getWideningCost; otherwise, the patch looks fine to me now. But please let Michael have one more pass at review since he's been quiet for a while. Thanks!

../lib/Transforms/Vectorize/LoopVectorize.cpp
1995–1996 ↗	(On Diff #87215)	This seems weird to me. All instructions are supposed to have a cost computed by the cost model. I would much rather us assert that I is in WideningDecisions. Isn't it true that if I is a load or store, we would have already computed a cost and saved it in WideningDecisions?
7099 ↗	(On Diff #87215)	Just assert inside getWideningCost() that "I" is present in the mapping and return the cost. The int to unsigned max conversion is unnecessary.

Updated according to the resent Matthew's comments.
Matthew, thanks a lot for your review.

Michael, could you, please, take a look?

The high-level structure of this looks good to me, thanks Elena!

Some minor/style comments inline.

../lib/Transforms/Vectorize/LoopVectorize.cpp
330 ↗	(On Diff #87281)	We probably already have several clones of those functions around the code-base. And they are probably all slightly different. LSR has getAccessType(), LAA has getAddressSpaceOperand(), LoadStoreVectorizer has getPointerAddressSpace(), and I'm sure there are more. I don't want to make merging them a precondition of this patch, but can you please at least add a FIXME here?
1929 ↗	(On Diff #87281)	Why not just "return Uniforms[VF]->second.count(I);"? I don't think the verbosity helps here, and we don't actually care about the difference between find() and operator[] due to the assert. Or, to save a lookup in an asserts build, you could find() and then assert on the result of the find().
1937 ↗	(On Diff #87281)	Same as above.
1988 ↗	(On Diff #87281)	Here, on the other hand, I think find() would be better - that way you don't need two lookups.
2120 ↗	(On Diff #87281)	I'm still not sure I understand why this gets called twice for a user-provided VF. Could you explain again?
5512 ↗	(On Diff #87281)	Do we still want this check here? I mean: An instruction with a uniform pointer can be widended. That ties in with the way this is used - we check for the uniform case before the widening case, as we should. So, IIUC, we should never actually hit this.
7022 ↗	(On Diff #87281)	UINT_MAX
7036 ↗	(On Diff #87281)	Why the NumAccesses * 2 cut-off?
7047 ↗	(On Diff #87281)	UINT_MAX
7060 ↗	(On Diff #87281)	Why do you need the "GatherScatterCost < InterleaveCost" check here?
1930 ↗	(On Diff #86431)	Maybe rename NONE to UNKNOWN, then? But I'm fine with None if you think that's clearer. Also, I looked up the naming convention for enums in the coding standard, and I think it should be something like "CM_Widen, CM_Interleave", etc, not all caps.

delena marked 5 inline comments as done.Feb 7 2017, 12:48 AM

delena added inline comments.

../lib/Transforms/Vectorize/LoopVectorize.cpp
1929 ↗	(On Diff #87281)	The "const" qualifier does not allow the Uniforms[VF] form.
2120 ↗	(On Diff #87281)	We call this function from multiple places. From calculateRegisterUsage(), selectVectorizationFactor() - user-defined-VF and from expectedCost(). I just prevent recalculation.
7036 ↗	(On Diff #87281)	I consider a cost per instruction. In this case InterleaveCost / NumAccesses == 1. (Matthew asked to avoid divisions). About 1 inst per access is good enough. I added more comments.

Some code improvements, addressed Michael's comments.

mkuper added inline comments.Feb 7 2017, 9:27 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7069 ↗	(On Diff #87371)	So, in case the costs are equal, you prefer scalarization to interleaving, and interleaving to scatter/gather? (In theory it shouldn't matter what happens when the costs are equal, just making sure this I understand.)
1929 ↗	(On Diff #87281)	Ohh, right, missed it's const, sorry.
2120 ↗	(On Diff #87281)	I understand, I'm just wondering whether we really need to do that. Anyway, it's not a new problem, we don't have to solve it here.
5512 ↗	(On Diff #87281)	I think you may have missed this comment.
7036 ↗	(On Diff #87281)	Well, this isn't really "about 1", this is "below 2". I'd be more conservative here (InerelaveCost <= NumAccesses ? Can this even happen? Or are you trying to catch the cases where the ratio is ~1.1-1.2?). Or maybe remove this altogether. Is getGatherScatterCost() expensive in terms of compile time?

mssimpso added inline comments.Feb 7 2017, 9:36 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7036 ↗	(On Diff #87281)	The TTI estimates are supposed to be cheap to compute. I think it makes sense to remove this altogether in favor of greater simplicity.

delena added inline comments.Feb 8 2017, 1:02 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7069 ↗	(On Diff #87371)	Matthew asked to change: "I think we're fairly consistent in other places where we compare costs, that we prefer the scalar version if there's no benefit for vectorization. So if the scalarization cost is <= the other costs, I think we should go with that."
2120 ↗	(On Diff #87281)	In the current version, before my changes, we calculate Uniforms and Scalars once and do this in Legality. In this patch, I moved Uniforms and Scalars from Legality to the Cost Model and calculate them per VF. I did not find a single place to put the call, I'm calling the collectUniformsAndScalars() from multiple places. Doing that, I want to prevent the data recalculation. I suppose that finding a right single place for calling collectUniformsAndScalars() is possible, but it will require additional movements in selectVectorizationFactor(). I think it can be done in a separate patch.
5512 ↗	(On Diff #87281)	I'll fix.
7036 ↗	(On Diff #87281)	ok.

Some fixes following Michael's recent comments.

LGTM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7069 ↗	(On Diff #87371)	Ah, ok, missed that. Could you please add this as an explicit comment?
2120 ↗	(On Diff #87281)	Ah, ok, got it. Sure, sounds good as a follow-up.

This revision is now accepted and ready to land.Feb 8 2017, 10:21 AM

Closed by commit rL294503: [Loop Vectorizer] Cost-based decision for vectorization form of memory… (authored by delena). · Explain WhyFeb 8 2017, 11:37 AM

This revision was automatically updated to reflect the committed changes.

lebedev.ri mentioned this in D111460: [X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`.Oct 9 2021, 1:05 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

754 lines

test/

Transforms/

LoopVectorize/

AArch64/

interleaved-vs-scalar.ll

38 lines

X86/

consecutive-ptr-uniforms.ll

3 lines

gather-vs-interleave.ll

41 lines

Diff 87683

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 311 Lines • ▼ Show 20 Lines	if (isa<BitCastInst>(Ptr) &&
Type *Pointee2Ty = cast<PointerType>(GEPTy)->getPointerElementType();		Type *Pointee2Ty = cast<PointerType>(GEPTy)->getPointerElementType();
const DataLayout &DL = cast<BitCastInst>(Ptr)->getModule()->getDataLayout();		const DataLayout &DL = cast<BitCastInst>(Ptr)->getModule()->getDataLayout();
if (DL.getTypeSizeInBits(Pointee1Ty) == DL.getTypeSizeInBits(Pointee2Ty))		if (DL.getTypeSizeInBits(Pointee1Ty) == DL.getTypeSizeInBits(Pointee2Ty))
return cast<GetElementPtrInst>(cast<BitCastInst>(Ptr)->getOperand(0));		return cast<GetElementPtrInst>(cast<BitCastInst>(Ptr)->getOperand(0));
}		}
return nullptr;		return nullptr;
}		}

		// FIXME: The following helper functions have multiple implementations
		// in the project. They can be effectively organized in a common Load/Store
		// utilities unit.

/// A helper function that returns the pointer operand of a load or store		/// A helper function that returns the pointer operand of a load or store
/// instruction.		/// instruction.
static Value getPointerOperand(Value I) {		static Value getPointerOperand(Value I) {
if (auto *LI = dyn_cast<LoadInst>(I))		if (auto *LI = dyn_cast<LoadInst>(I))
return LI->getPointerOperand();		return LI->getPointerOperand();
if (auto *SI = dyn_cast<StoreInst>(I))		if (auto *SI = dyn_cast<StoreInst>(I))
return SI->getPointerOperand();		return SI->getPointerOperand();
return nullptr;		return nullptr;
}		}

		/// A helper function that returns the type of loaded or stored value.
		static Type getMemInstValueType(Value I) {
		assert((isa<LoadInst>(I) \|\| isa<StoreInst>(I)) &&
		"Expected Load or Store instruction");
		if (auto *LI = dyn_cast<LoadInst>(I))
		return LI->getType();
		return cast<StoreInst>(I)->getValueOperand()->getType();
		}

		/// A helper function that returns the alignment of load or store instruction.
		static unsigned getMemInstAlignment(Value *I) {
		assert((isa<LoadInst>(I) \|\| isa<StoreInst>(I)) &&
		"Expected Load or Store instruction");
		if (auto *LI = dyn_cast<LoadInst>(I))
		return LI->getAlignment();
		return cast<StoreInst>(I)->getAlignment();
		}

		/// A helper function that returns the address space of the pointer operand of
		/// load or store instruction.
		static unsigned getMemInstAddressSpace(Value *I) {
		assert((isa<LoadInst>(I) \|\| isa<StoreInst>(I)) &&
		"Expected Load or Store instruction");
		if (auto *LI = dyn_cast<LoadInst>(I))
		return LI->getPointerAddressSpace();
		return cast<StoreInst>(I)->getPointerAddressSpace();
		}

/// A helper function that returns true if the given type is irregular. The		/// A helper function that returns true if the given type is irregular. The
/// type is irregular if its allocated size doesn't equal the store size of an		/// type is irregular if its allocated size doesn't equal the store size of an
/// element of the corresponding vector type at the given vectorization factor.		/// element of the corresponding vector type at the given vectorization factor.
static bool hasIrregularType(Type *Ty, const DataLayout &DL, unsigned VF) {		static bool hasIrregularType(Type *Ty, const DataLayout &DL, unsigned VF) {

// Determine if an array of VF elements of type Ty is "bitcast compatible"		// Determine if an array of VF elements of type Ty is "bitcast compatible"
// with a <VF x Ty> vector.		// with a <VF x Ty> vector.
if (VF > 1) {		if (VF > 1) {
▲ Show 20 Lines • Show All 1,269 Lines • ▼ Show 20 Lines	public:
/// 0 - Stride is unknown or non-consecutive.		/// 0 - Stride is unknown or non-consecutive.
/// 1 - Address is consecutive.		/// 1 - Address is consecutive.
/// -1 - Address is consecutive, and decreasing.		/// -1 - Address is consecutive, and decreasing.
int isConsecutivePtr(Value *Ptr);		int isConsecutivePtr(Value *Ptr);

/// Returns true if the value V is uniform within the loop.		/// Returns true if the value V is uniform within the loop.
bool isUniform(Value *V);		bool isUniform(Value *V);

/// Returns true if \p I is known to be uniform after vectorization.
bool isUniformAfterVectorization(Instruction *I) { return Uniforms.count(I); }

/// Returns true if \p I is known to be scalar after vectorization.
bool isScalarAfterVectorization(Instruction *I) { return Scalars.count(I); }

/// Returns the information that we collected about runtime memory check.		/// Returns the information that we collected about runtime memory check.
const RuntimePointerChecking *getRuntimePointerChecking() const {		const RuntimePointerChecking *getRuntimePointerChecking() const {
return LAI->getRuntimePointerChecking();		return LAI->getRuntimePointerChecking();
}		}

const LoopAccessInfo *getLAI() const { return LAI; }		const LoopAccessInfo *getLAI() const { return LAI; }

/// \brief Check if \p Instr belongs to any interleaved access group.		/// \brief Check if \p Instr belongs to any interleaved access group.
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	public:
unsigned getNumLoads() const { return LAI->getNumLoads(); }		unsigned getNumLoads() const { return LAI->getNumLoads(); }
unsigned getNumPredStores() const { return NumPredStores; }		unsigned getNumPredStores() const { return NumPredStores; }

/// Returns true if \p I is an instruction that will be scalarized with		/// Returns true if \p I is an instruction that will be scalarized with
/// predication. Such instructions include conditional stores and		/// predication. Such instructions include conditional stores and
/// instructions that may divide by zero.		/// instructions that may divide by zero.
bool isScalarWithPredication(Instruction *I);		bool isScalarWithPredication(Instruction *I);

/// Returns true if \p I is a memory instruction that has a consecutive or		/// Returns true if \p I is a memory instruction with consecutive memory
/// consecutive-like pointer operand. Consecutive-like pointers are pointers		/// access that can be widened.
/// that are treated like consecutive pointers during vectorization. The		bool memoryInstructionCanBeWidened(Instruction *I, unsigned VF = 1);
/// pointer operands of interleaved accesses are an example.
bool hasConsecutiveLikePtrOperand(Instruction *I);

/// Returns true if \p I is a memory instruction that must be scalarized
/// during vectorization.
bool memoryInstructionMustBeScalarized(Instruction *I, unsigned VF = 1);

private:		private:
/// Check if a single basic block loop is vectorizable.		/// Check if a single basic block loop is vectorizable.
/// At this point we know that this is a loop with a constant trip count		/// At this point we know that this is a loop with a constant trip count
/// and we only need to check individual instructions.		/// and we only need to check individual instructions.
bool canVectorizeInstrs();		bool canVectorizeInstrs();

/// When we vectorize loops we may change the order in which		/// When we vectorize loops we may change the order in which
/// we read and write from memory. This method checks if it is		/// we read and write from memory. This method checks if it is
/// legal to vectorize the code, considering only memory constrains.		/// legal to vectorize the code, considering only memory constrains.
/// Returns true if the loop is vectorizable		/// Returns true if the loop is vectorizable
bool canVectorizeMemory();		bool canVectorizeMemory();

/// Return true if we can vectorize this loop using the IF-conversion		/// Return true if we can vectorize this loop using the IF-conversion
/// transformation.		/// transformation.
bool canVectorizeWithIfConvert();		bool canVectorizeWithIfConvert();

/// Collect the instructions that are uniform after vectorization. An
/// instruction is uniform if we represent it with a single scalar value in
/// the vectorized loop corresponding to each vector iteration. Examples of
/// uniform instructions include pointer operands of consecutive or
/// interleaved memory accesses. Note that although uniformity implies an
/// instruction will be scalar, the reverse is not true. In general, a
/// scalarized instruction will be represented by VF scalar values in the
/// vectorized loop, each corresponding to an iteration of the original
/// scalar loop.
void collectLoopUniforms();

/// Collect the instructions that are scalar after vectorization. An
/// instruction is scalar if it is known to be uniform or will be scalarized
/// during vectorization. Non-uniform scalarized instructions will be
/// represented by VF values in the vectorized loop, each corresponding to an
/// iteration of the original scalar loop.
void collectLoopScalars();

/// Return true if all of the instructions in the block can be speculatively		/// Return true if all of the instructions in the block can be speculatively
/// executed. \p SafePtrs is a list of addresses that are known to be legal		/// executed. \p SafePtrs is a list of addresses that are known to be legal
/// and we know that we can read from them without segfault.		/// and we know that we can read from them without segfault.
bool blockCanBePredicated(BasicBlock BB, SmallPtrSetImpl<Value > &SafePtrs);		bool blockCanBePredicated(BasicBlock BB, SmallPtrSetImpl<Value > &SafePtrs);

/// Updates the vectorization state by adding \p Phi to the inductions list.		/// Updates the vectorization state by adding \p Phi to the inductions list.
/// This can set \p Phi as the main induction of the loop if \p Phi is a		/// This can set \p Phi as the main induction of the loop if \p Phi is a
/// better choice for the main induction than the existing one.		/// better choice for the main induction than the existing one.
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	private:
RecurrenceSet FirstOrderRecurrences;		RecurrenceSet FirstOrderRecurrences;
/// Holds the widest induction type encountered.		/// Holds the widest induction type encountered.
Type *WidestIndTy;		Type *WidestIndTy;

/// Allowed outside users. This holds the induction and reduction		/// Allowed outside users. This holds the induction and reduction
/// vars which can be accessed from outside the loop.		/// vars which can be accessed from outside the loop.
SmallPtrSet<Value *, 4> AllowedExit;		SmallPtrSet<Value *, 4> AllowedExit;

/// Holds the instructions known to be uniform after vectorization.
SmallPtrSet<Instruction *, 4> Uniforms;

/// Holds the instructions known to be scalar after vectorization.
SmallPtrSet<Instruction *, 4> Scalars;

/// Can we assume the absence of NaNs.		/// Can we assume the absence of NaNs.
bool HasFunNoNaNAttr;		bool HasFunNoNaNAttr;

/// Vectorization requirements that will go through late-evaluation.		/// Vectorization requirements that will go through late-evaluation.
LoopVectorizationRequirements *Requirements;		LoopVectorizationRequirements *Requirements;

/// Used to emit an analysis of any legality issues.		/// Used to emit an analysis of any legality issues.
LoopVectorizeHints *Hints;		LoopVectorizeHints *Hints;
Show All 40 Lines	public:

/// \return The desired interleave count.		/// \return The desired interleave count.
/// If interleave count has been specified by metadata it will be returned.		/// If interleave count has been specified by metadata it will be returned.
/// Otherwise, the interleave count is computed and returned. VF and LoopCost		/// Otherwise, the interleave count is computed and returned. VF and LoopCost
/// are the selected vectorization factor and the cost of the selected VF.		/// are the selected vectorization factor and the cost of the selected VF.
unsigned selectInterleaveCount(bool OptForSize, unsigned VF,		unsigned selectInterleaveCount(bool OptForSize, unsigned VF,
unsigned LoopCost);		unsigned LoopCost);

		/// Memory access instruction may be vectorized in more than one way.
		/// Form of instruction after vectorization depends on cost.
		/// This function takes cost-based decisions for Load/Store instructions
		/// and collects them in a map. This decisions map is used for building
		/// the lists of loop-uniform and loop-scalar instructions.
		/// The calculated cost is saved with widening decision in order to
		/// avoid redundant calculations.
		void setCostBasedWideningDecision(unsigned VF);

/// \brief A struct that represents some properties of the register usage		/// \brief A struct that represents some properties of the register usage
/// of a loop.		/// of a loop.
struct RegisterUsage {		struct RegisterUsage {
/// Holds the number of loop invariant values that are used in the loop.		/// Holds the number of loop invariant values that are used in the loop.
unsigned LoopInvariantRegs;		unsigned LoopInvariantRegs;
/// Holds the maximum number of concurrent live intervals in the loop.		/// Holds the maximum number of concurrent live intervals in the loop.
unsigned MaxLocalUsers;		unsigned MaxLocalUsers;
/// Holds the number of instructions in the loop.		/// Holds the number of instructions in the loop.
Show All 18 Lines	public:
/// vectorization factor \p VF.		/// vectorization factor \p VF.
bool isProfitableToScalarize(Instruction *I, unsigned VF) const {		bool isProfitableToScalarize(Instruction *I, unsigned VF) const {
auto Scalars = InstsToScalarize.find(VF);		auto Scalars = InstsToScalarize.find(VF);
assert(Scalars != InstsToScalarize.end() &&		assert(Scalars != InstsToScalarize.end() &&
"VF not yet analyzed for scalarization profitability");		"VF not yet analyzed for scalarization profitability");
return Scalars->second.count(I);		return Scalars->second.count(I);
}		}

		/// Returns true if \p I is known to be uniform after vectorization.
		bool isUniformAfterVectorization(Instruction *I, unsigned VF) const {
		if (VF == 1)
		return true;
		assert(Uniforms.count(VF) && "VF not yet analyzed for uniformity");
		auto UniformsPerVF = Uniforms.find(VF);
		return UniformsPerVF->second.count(I);
		}

		/// Returns true if \p I is known to be scalar after vectorization.
		bool isScalarAfterVectorization(Instruction *I, unsigned VF) const {
		if (VF == 1)
		return true;
		assert(Scalars.count(VF) && "Scalar values are not calculated for VF");
		auto ScalarsPerVF = Scalars.find(VF);
		return ScalarsPerVF->second.count(I);
		}

/// \returns True if instruction \p I can be truncated to a smaller bitwidth		/// \returns True if instruction \p I can be truncated to a smaller bitwidth
/// for vectorization factor \p VF.		/// for vectorization factor \p VF.
bool canTruncateToMinimalBitwidth(Instruction *I, unsigned VF) const {		bool canTruncateToMinimalBitwidth(Instruction *I, unsigned VF) const {
return VF > 1 && MinBWs.count(I) && !isProfitableToScalarize(I, VF) &&		return VF > 1 && MinBWs.count(I) && !isProfitableToScalarize(I, VF) &&
!Legal->isScalarAfterVectorization(I);		!isScalarAfterVectorization(I, VF);
		}

		/// Decision that was taken during cost calculation for memory instruction.
		enum InstWidening {
		CM_Unknown,
		CM_Widen,
		CM_Interleave,
		CM_GatherScatter,
		CM_Scalarize
		};

		/// Save vectorization decision \p W and \p Cost taken by the cost model for
		/// instruction \p I and vector width \p VF.
		void setWideningDecision(Instruction *I, unsigned VF, InstWidening W,
		unsigned Cost) {
		assert(VF >= 2 && "Expected VF >=2");
		WideningDecisions[std::make_pair(I, VF)] = std::make_pair(W, Cost);
		}

		/// Save vectorization decision \p W and \p Cost taken by the cost model for
		/// interleaving group \p Grp and vector width \p VF.
		void setWideningDecision(const InterleaveGroup *Grp, unsigned VF,
		InstWidening W, unsigned Cost) {
		assert(VF >= 2 && "Expected VF >=2");
		/// Broadcast this decicion to all instructions inside the group.
		/// But the cost will be assigned to one instruction only.
		for (unsigned i = 0; i < Grp->getFactor(); ++i) {
		if (auto *I = Grp->getMember(i)) {
		if (Grp->getInsertPos() == I)
		WideningDecisions[std::make_pair(I, VF)] = std::make_pair(W, Cost);
		else
		WideningDecisions[std::make_pair(I, VF)] = std::make_pair(W, 0);
		}
		}
		}

		/// Return the cost model decision for the given instruction \p I and vector
		/// width \p VF. Return CM_Unknown if this instruction did not pass
		/// through the cost modeling.
		InstWidening getWideningDecision(Instruction *I, unsigned VF) {
		assert(VF >= 2 && "Expected VF >=2");
		std::pair<Instruction *, unsigned> InstOnVF = std::make_pair(I, VF);
		auto Itr = WideningDecisions.find(InstOnVF);
		if (Itr == WideningDecisions.end())
		return CM_Unknown;
		return Itr->second.first;
		}

		/// Return the vectorization cost for the given instruction \p I and vector
		/// width \p VF.
		unsigned getWideningCost(Instruction *I, unsigned VF) {
		assert(VF >= 2 && "Expected VF >=2");
		std::pair<Instruction *, unsigned> InstOnVF = std::make_pair(I, VF);
		assert(WideningDecisions.count(InstOnVF) && "The cost is not calculated");
		return WideningDecisions[InstOnVF].second;
}		}

private:		private:
/// The vectorization cost is a combination of the cost itself and a boolean		/// The vectorization cost is a combination of the cost itself and a boolean
/// indicating whether any of the contributing operations will actually		/// indicating whether any of the contributing operations will actually
/// operate on		/// operate on
/// vector values after type legalization in the backend. If this latter value		/// vector values after type legalization in the backend. If this latter value
/// is		/// is
Show All 10 Lines	private:
/// Returns the execution time cost of an instruction for a given vector		/// Returns the execution time cost of an instruction for a given vector
/// width. Vector width of one means scalar.		/// width. Vector width of one means scalar.
VectorizationCostTy getInstructionCost(Instruction *I, unsigned VF);		VectorizationCostTy getInstructionCost(Instruction *I, unsigned VF);

/// The cost-computation logic from getInstructionCost which provides		/// The cost-computation logic from getInstructionCost which provides
/// the vector type as an output parameter.		/// the vector type as an output parameter.
unsigned getInstructionCost(Instruction I, unsigned VF, Type &VectorTy);		unsigned getInstructionCost(Instruction I, unsigned VF, Type &VectorTy);

		/// Calculate vectorization cost of memory instruction \p I.
		unsigned getMemoryInstructionCost(Instruction *I, unsigned VF);

		/// The cost computation for scalarized memory instruction.
		unsigned getMemInstScalarizationCost(Instruction *I, unsigned VF);

		/// The cost computation for interleaving group of memory instructions.
		unsigned getInterleaveGroupCost(Instruction *I, unsigned VF);

		/// The cost computation for Gather/Scatter instruction.
		unsigned getGatherScatterCost(Instruction *I, unsigned VF);

		/// The cost computation for widening instruction \p I with consecutive
		/// memory access.
		unsigned getConsecutiveMemOpCost(Instruction *I, unsigned VF);

		/// The cost calculation for Load instruction \p I with uniform pointer -
		/// scalar load + broadcast.
		unsigned getUniformMemOpCost(Instruction *I, unsigned VF);

/// Returns whether the instruction is a load or store and will be a emitted		/// Returns whether the instruction is a load or store and will be a emitted
/// as a vector operation.		/// as a vector operation.
bool isConsecutiveLoadOrStore(Instruction *I);		bool isConsecutiveLoadOrStore(Instruction *I);

/// Create an analysis remark that explains why vectorization failed		/// Create an analysis remark that explains why vectorization failed
///		///
/// \p RemarkName is the identifier for the remark. \return the remark object		/// \p RemarkName is the identifier for the remark. \return the remark object
/// that can be streamed to.		/// that can be streamed to.
Show All 13 Lines	private:
typedef DenseMap<Instruction *, unsigned> ScalarCostsTy;		typedef DenseMap<Instruction *, unsigned> ScalarCostsTy;

/// A map holding scalar costs for different vectorization factors. The		/// A map holding scalar costs for different vectorization factors. The
/// presence of a cost for an instruction in the mapping indicates that the		/// presence of a cost for an instruction in the mapping indicates that the
/// instruction will be scalarized when vectorizing with the associated		/// instruction will be scalarized when vectorizing with the associated
/// vectorization factor. The entries are VF-ScalarCostTy pairs.		/// vectorization factor. The entries are VF-ScalarCostTy pairs.
DenseMap<unsigned, ScalarCostsTy> InstsToScalarize;		DenseMap<unsigned, ScalarCostsTy> InstsToScalarize;

		/// Holds the instructions known to be uniform after vectorization.
		/// The data is collected per VF.
		DenseMap<unsigned, SmallPtrSet<Instruction *, 4>> Uniforms;

		/// Holds the instructions known to be scalar after vectorization.
		/// The data is collected per VF.
		DenseMap<unsigned, SmallPtrSet<Instruction *, 4>> Scalars;

/// Returns the expected difference in cost from scalarizing the expression		/// Returns the expected difference in cost from scalarizing the expression
/// feeding a predicated instruction \p PredInst. The instructions to		/// feeding a predicated instruction \p PredInst. The instructions to
/// scalarize and their scalar costs are collected in \p ScalarCosts. A		/// scalarize and their scalar costs are collected in \p ScalarCosts. A
/// non-negative return value implies the expression will be scalarized.		/// non-negative return value implies the expression will be scalarized.
/// Currently, only single-use chains are considered for scalarization.		/// Currently, only single-use chains are considered for scalarization.
int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,		int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,
unsigned VF);		unsigned VF);

/// Collects the instructions to scalarize for each predicated instruction in		/// Collects the instructions to scalarize for each predicated instruction in
/// the loop.		/// the loop.
void collectInstsToScalarize(unsigned VF);		void collectInstsToScalarize(unsigned VF);

		/// Collect the instructions that are uniform after vectorization. An
		/// instruction is uniform if we represent it with a single scalar value in
		/// the vectorized loop corresponding to each vector iteration. Examples of
		/// uniform instructions include pointer operands of consecutive or
		/// interleaved memory accesses. Note that although uniformity implies an
		/// instruction will be scalar, the reverse is not true. In general, a
		/// scalarized instruction will be represented by VF scalar values in the
		/// vectorized loop, each corresponding to an iteration of the original
		/// scalar loop.
		void collectLoopUniforms(unsigned VF);

		/// Collect the instructions that are scalar after vectorization. An
		/// instruction is scalar if it is known to be uniform or will be scalarized
		/// during vectorization. Non-uniform scalarized instructions will be
		/// represented by VF values in the vectorized loop, each corresponding to an
		/// iteration of the original scalar loop.
		void collectLoopScalars(unsigned VF);

		/// Collect Uniform and Scalar values for the given \p VF.
		/// The sets depend on CM decision for Load/Store instructions
		/// that may be vectorized as interleave, gather-scatter or scalarized.
		void collectUniformsAndScalars(unsigned VF) {
		// Do the analysis once.
		if (VF == 1 \|\| Uniforms.count(VF))
		return;
		setCostBasedWideningDecision(VF);
		collectLoopUniforms(VF);
		collectLoopScalars(VF);
		}

		/// Keeps cost model vectorization decision and cost for instructions.
		/// Right now it is used for memory instructions only.
		typedef DenseMap<std::pair<Instruction *, unsigned>,
		std::pair<InstWidening, unsigned>>
		DecisionList;

		DecisionList WideningDecisions;

public:		public:
/// The loop that we evaluate.		/// The loop that we evaluate.
Loop *TheLoop;		Loop *TheLoop;
/// Predicated scalar evolution analysis.		/// Predicated scalar evolution analysis.
PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;
/// Loop Info analysis.		/// Loop Info analysis.
LoopInfo *LI;		LoopInfo *LI;
/// Vectorization legality.		/// Vectorization legality.
▲ Show 20 Lines • Show All 217 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::createVectorIntInductionPHI(
LastInduction->moveBefore(ICmp);		LastInduction->moveBefore(ICmp);
LastInduction->setName("vec.ind.next");		LastInduction->setName("vec.ind.next");

VecInd->addIncoming(SteppedStart, LoopVectorPreHeader);		VecInd->addIncoming(SteppedStart, LoopVectorPreHeader);
VecInd->addIncoming(LastInduction, LoopVectorLatch);		VecInd->addIncoming(LastInduction, LoopVectorLatch);
}		}

bool InnerLoopVectorizer::shouldScalarizeInstruction(Instruction *I) const {		bool InnerLoopVectorizer::shouldScalarizeInstruction(Instruction *I) const {
return Legal->isScalarAfterVectorization(I) \|\|		return Cost->isScalarAfterVectorization(I, VF) \|\|
Cost->isProfitableToScalarize(I, VF);		Cost->isProfitableToScalarize(I, VF);
}		}

bool InnerLoopVectorizer::needsScalarInduction(Instruction *IV) const {		bool InnerLoopVectorizer::needsScalarInduction(Instruction *IV) const {
if (shouldScalarizeInstruction(IV))		if (shouldScalarizeInstruction(IV))
return true;		return true;
auto isScalarInst = [&](User *U) -> bool {		auto isScalarInst = [&](User *U) -> bool {
auto *I = cast<Instruction>(U);		auto *I = cast<Instruction>(U);
▲ Show 20 Lines • Show All 159 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::buildScalarSteps(Value ScalarIV, Value Step,
Type *ScalarIVTy = ScalarIV->getType()->getScalarType();		Type *ScalarIVTy = ScalarIV->getType()->getScalarType();
assert(ScalarIVTy->isIntegerTy() && ScalarIVTy == Step->getType() &&		assert(ScalarIVTy->isIntegerTy() && ScalarIVTy == Step->getType() &&
"Val and Step should have the same integer type");		"Val and Step should have the same integer type");

// Determine the number of scalars we need to generate for each unroll		// Determine the number of scalars we need to generate for each unroll
// iteration. If EntryVal is uniform, we only need to generate the first		// iteration. If EntryVal is uniform, we only need to generate the first
// lane. Otherwise, we generate all VF values.		// lane. Otherwise, we generate all VF values.
unsigned Lanes =		unsigned Lanes =
Legal->isUniformAfterVectorization(cast<Instruction>(EntryVal)) ? 1 : VF;		Cost->isUniformAfterVectorization(cast<Instruction>(EntryVal), VF) ? 1 : VF;

// Compute the scalar steps and save the results in VectorLoopValueMap.		// Compute the scalar steps and save the results in VectorLoopValueMap.
ScalarParts Entry(UF);		ScalarParts Entry(UF);
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
Entry[Part].resize(VF);		Entry[Part].resize(VF);
for (unsigned Lane = 0; Lane < Lanes; ++Lane) {		for (unsigned Lane = 0; Lane < Lanes; ++Lane) {
auto StartIdx = ConstantInt::get(ScalarIVTy, VF Part + Lane);		auto StartIdx = ConstantInt::get(ScalarIVTy, VF Part + Lane);
auto *Mul = Builder.CreateMul(StartIdx, Step);		auto *Mul = Builder.CreateMul(StartIdx, Step);
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	if (VF == 1) {
Entry[Part] = getScalarValue(V, Part, 0);		Entry[Part] = getScalarValue(V, Part, 0);
return VectorLoopValueMap.initVector(V, Entry);		return VectorLoopValueMap.initVector(V, Entry);
}		}

// Get the last scalar instruction we generated for V. If the value is		// Get the last scalar instruction we generated for V. If the value is
// known to be uniform after vectorization, this corresponds to lane zero		// known to be uniform after vectorization, this corresponds to lane zero
// of the last unroll iteration. Otherwise, the last instruction is the one		// of the last unroll iteration. Otherwise, the last instruction is the one
// we created for the last vector lane of the last unroll iteration.		// we created for the last vector lane of the last unroll iteration.
unsigned LastLane = Legal->isUniformAfterVectorization(I) ? 0 : VF - 1;		unsigned LastLane = Cost->isUniformAfterVectorization(I, VF) ? 0 : VF - 1;
auto *LastInst = cast<Instruction>(getScalarValue(V, UF - 1, LastLane));		auto *LastInst = cast<Instruction>(getScalarValue(V, UF - 1, LastLane));

// Set the insert point after the last scalarized instruction. This ensures		// Set the insert point after the last scalarized instruction. This ensures
// the insertelement sequence will directly follow the scalar definitions.		// the insertelement sequence will directly follow the scalar definitions.
auto OldIP = Builder.saveIP();		auto OldIP = Builder.saveIP();
auto NewIP = std::next(BasicBlock::iterator(LastInst));		auto NewIP = std::next(BasicBlock::iterator(LastInst));
Builder.SetInsertPoint(&*NewIP);		Builder.SetInsertPoint(&*NewIP);

// However, if we are vectorizing, we need to construct the vector values.		// However, if we are vectorizing, we need to construct the vector values.
// If the value is known to be uniform after vectorization, we can just		// If the value is known to be uniform after vectorization, we can just
// broadcast the scalar value corresponding to lane zero for each unroll		// broadcast the scalar value corresponding to lane zero for each unroll
// iteration. Otherwise, we construct the vector values using insertelement		// iteration. Otherwise, we construct the vector values using insertelement
// instructions. Since the resulting vectors are stored in		// instructions. Since the resulting vectors are stored in
// VectorLoopValueMap, we will only generate the insertelements once.		// VectorLoopValueMap, we will only generate the insertelements once.
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
Value *VectorValue = nullptr;		Value *VectorValue = nullptr;
if (Legal->isUniformAfterVectorization(I)) {		if (Cost->isUniformAfterVectorization(I, VF)) {
VectorValue = getBroadcastInstrs(getScalarValue(V, Part, 0));		VectorValue = getBroadcastInstrs(getScalarValue(V, Part, 0));
} else {		} else {
VectorValue = UndefValue::get(VectorType::get(V->getType(), VF));		VectorValue = UndefValue::get(VectorType::get(V->getType(), VF));
for (unsigned Lane = 0; Lane < VF; ++Lane)		for (unsigned Lane = 0; Lane < VF; ++Lane)
VectorValue = Builder.CreateInsertElement(		VectorValue = Builder.CreateInsertElement(
VectorValue, getScalarValue(V, Part, Lane),		VectorValue, getScalarValue(V, Part, Lane),
Builder.getInt32(Lane));		Builder.getInt32(Lane));
}		}
Show All 12 Lines
Value InnerLoopVectorizer::getScalarValue(Value V, unsigned Part,		Value InnerLoopVectorizer::getScalarValue(Value V, unsigned Part,
unsigned Lane) {		unsigned Lane) {

// If the value is not an instruction contained in the loop, it should		// If the value is not an instruction contained in the loop, it should
// already be scalar.		// already be scalar.
if (OrigLoop->isLoopInvariant(V))		if (OrigLoop->isLoopInvariant(V))
return V;		return V;

assert(Lane > 0 ? !Legal->isUniformAfterVectorization(cast<Instruction>(V))		assert(Lane > 0 ?
		!Cost->isUniformAfterVectorization(cast<Instruction>(V), VF)
: true && "Uniform values only have lane zero");		: true && "Uniform values only have lane zero");

// If the value from the original loop has not been vectorized, it is		// If the value from the original loop has not been vectorized, it is
// represented by UF x VF scalar values in the new loop. Return the requested		// represented by UF x VF scalar values in the new loop. Return the requested
// scalar value.		// scalar value.
if (VectorLoopValueMap.hasScalar(V))		if (VectorLoopValueMap.hasScalar(V))
return VectorLoopValueMap.ScalarMapStorage[V][Part][Lane];		return VectorLoopValueMap.ScalarMapStorage[V][Part][Lane];

// If the value has not been scalarized, get its entry in VectorLoopValueMap		// If the value has not been scalarized, get its entry in VectorLoopValueMap
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines
void InnerLoopVectorizer::vectorizeInterleaveGroup(Instruction *Instr) {		void InnerLoopVectorizer::vectorizeInterleaveGroup(Instruction *Instr) {
const InterleaveGroup *Group = Legal->getInterleavedAccessGroup(Instr);		const InterleaveGroup *Group = Legal->getInterleavedAccessGroup(Instr);
assert(Group && "Fail to get an interleaved access group.");		assert(Group && "Fail to get an interleaved access group.");

// Skip if current instruction is not the insert position.		// Skip if current instruction is not the insert position.
if (Instr != Group->getInsertPos())		if (Instr != Group->getInsertPos())
return;		return;

LoadInst *LI = dyn_cast<LoadInst>(Instr);
StoreInst *SI = dyn_cast<StoreInst>(Instr);
Value *Ptr = getPointerOperand(Instr);		Value *Ptr = getPointerOperand(Instr);

// Prepare for the vector type of the interleaved load/store.		// Prepare for the vector type of the interleaved load/store.
Type *ScalarTy = LI ? LI->getType() : SI->getValueOperand()->getType();		Type *ScalarTy = getMemInstValueType(Instr);
unsigned InterleaveFactor = Group->getFactor();		unsigned InterleaveFactor = Group->getFactor();
Type VecTy = VectorType::get(ScalarTy, InterleaveFactor VF);		Type VecTy = VectorType::get(ScalarTy, InterleaveFactor VF);
Type *PtrTy = VecTy->getPointerTo(Ptr->getType()->getPointerAddressSpace());		Type *PtrTy = VecTy->getPointerTo(getMemInstAddressSpace(Instr));

// Prepare for the new pointers.		// Prepare for the new pointers.
setDebugLocFromInst(Builder, Ptr);		setDebugLocFromInst(Builder, Ptr);
SmallVector<Value *, 2> NewPtrs;		SmallVector<Value *, 2> NewPtrs;
unsigned Index = Group->getIndex(Instr);		unsigned Index = Group->getIndex(Instr);

// If the group is reverse, adjust the index to refer to the last vector lane		// If the group is reverse, adjust the index to refer to the last vector lane
// instead of the first. We adjust the index from the first vector lane,		// instead of the first. We adjust the index from the first vector lane,
Show All 23 Lines	for (unsigned Part = 0; Part < UF; Part++) {
// Cast to the vector pointer type.		// Cast to the vector pointer type.
NewPtrs.push_back(Builder.CreateBitCast(NewPtr, PtrTy));		NewPtrs.push_back(Builder.CreateBitCast(NewPtr, PtrTy));
}		}

setDebugLocFromInst(Builder, Instr);		setDebugLocFromInst(Builder, Instr);
Value *UndefVec = UndefValue::get(VecTy);		Value *UndefVec = UndefValue::get(VecTy);

// Vectorize the interleaved load group.		// Vectorize the interleaved load group.
if (LI) {		if (isa<LoadInst>(Instr)) {

// For each unroll part, create a wide load for the group.		// For each unroll part, create a wide load for the group.
SmallVector<Value *, 2> NewLoads;		SmallVector<Value *, 2> NewLoads;
for (unsigned Part = 0; Part < UF; Part++) {		for (unsigned Part = 0; Part < UF; Part++) {
auto *NewLoad = Builder.CreateAlignedLoad(		auto *NewLoad = Builder.CreateAlignedLoad(
NewPtrs[Part], Group->getAlignment(), "wide.vec");		NewPtrs[Part], Group->getAlignment(), "wide.vec");
addMetadata(NewLoad, Instr);		addMetadata(NewLoad, Instr);
NewLoads.push_back(NewLoad);		NewLoads.push_back(NewLoad);
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines

void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) {		void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) {
// Attempt to issue a wide load.		// Attempt to issue a wide load.
LoadInst *LI = dyn_cast<LoadInst>(Instr);		LoadInst *LI = dyn_cast<LoadInst>(Instr);
StoreInst *SI = dyn_cast<StoreInst>(Instr);		StoreInst *SI = dyn_cast<StoreInst>(Instr);

assert((LI \|\| SI) && "Invalid Load/Store instruction");		assert((LI \|\| SI) && "Invalid Load/Store instruction");

// Try to vectorize the interleave group if this access is interleaved.		LoopVectorizationCostModel::InstWidening Decision =
if (Legal->isAccessInterleaved(Instr))		Cost->getWideningDecision(Instr, VF);
		assert(Decision != LoopVectorizationCostModel::CM_Unknown &&
		"CM decision should be taken at this point");
		if (Decision == LoopVectorizationCostModel::CM_Interleave)
return vectorizeInterleaveGroup(Instr);		return vectorizeInterleaveGroup(Instr);

Type *ScalarDataTy = LI ? LI->getType() : SI->getValueOperand()->getType();		Type *ScalarDataTy = getMemInstValueType(Instr);
Type *DataTy = VectorType::get(ScalarDataTy, VF);		Type *DataTy = VectorType::get(ScalarDataTy, VF);
Value *Ptr = getPointerOperand(Instr);		Value *Ptr = getPointerOperand(Instr);
unsigned Alignment = LI ? LI->getAlignment() : SI->getAlignment();		unsigned Alignment = getMemInstAlignment(Instr);
// An alignment of 0 means target abi alignment. We need to use the scalar's		// An alignment of 0 means target abi alignment. We need to use the scalar's
// target abi alignment in such a case.		// target abi alignment in such a case.
const DataLayout &DL = Instr->getModule()->getDataLayout();		const DataLayout &DL = Instr->getModule()->getDataLayout();
if (!Alignment)		if (!Alignment)
Alignment = DL.getABITypeAlignment(ScalarDataTy);		Alignment = DL.getABITypeAlignment(ScalarDataTy);
unsigned AddressSpace = Ptr->getType()->getPointerAddressSpace();		unsigned AddressSpace = getMemInstAddressSpace(Instr);

// Scalarize the memory instruction if necessary.		// Scalarize the memory instruction if necessary.
if (Legal->memoryInstructionMustBeScalarized(Instr, VF))		if (Decision == LoopVectorizationCostModel::CM_Scalarize)
return scalarizeInstruction(Instr, Legal->isScalarWithPredication(Instr));		return scalarizeInstruction(Instr, Legal->isScalarWithPredication(Instr));

// Determine if the pointer operand of the access is either consecutive or		// Determine if the pointer operand of the access is either consecutive or
// reverse consecutive.		// reverse consecutive.
int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);		int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);
bool Reverse = ConsecutiveStride < 0;		bool Reverse = ConsecutiveStride < 0;

// Determine if either a gather or scatter operation is legal.
bool CreateGatherScatter =		bool CreateGatherScatter =
!ConsecutiveStride && Legal->isLegalGatherOrScatter(Instr);		(Decision == LoopVectorizationCostModel::CM_GatherScatter);

VectorParts VectorGep;		VectorParts VectorGep;

// Handle consecutive loads/stores.		// Handle consecutive loads/stores.
GetElementPtrInst *Gep = getGEPInstruction(Ptr);		GetElementPtrInst *Gep = getGEPInstruction(Ptr);
if (ConsecutiveStride) {		if (ConsecutiveStride) {
if (Gep) {		if (Gep) {
unsigned NumOperands = Gep->getNumOperands();		unsigned NumOperands = Gep->getNumOperands();
▲ Show 20 Lines • Show All 168 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr,

VectorParts Cond;		VectorParts Cond;
if (IfPredicateInstr)		if (IfPredicateInstr)
Cond = createBlockInMask(Instr->getParent());		Cond = createBlockInMask(Instr->getParent());

// Determine the number of scalars we need to generate for each unroll		// Determine the number of scalars we need to generate for each unroll
// iteration. If the instruction is uniform, we only need to generate the		// iteration. If the instruction is uniform, we only need to generate the
// first lane. Otherwise, we generate all VF values.		// first lane. Otherwise, we generate all VF values.
unsigned Lanes = Legal->isUniformAfterVectorization(Instr) ? 1 : VF;		unsigned Lanes = Cost->isUniformAfterVectorization(Instr, VF) ? 1 : VF;

// For each vector unroll 'part':		// For each vector unroll 'part':
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
Entry[Part].resize(VF);		Entry[Part].resize(VF);
// For each scalar that we create:		// For each scalar that we create:
for (unsigned Lane = 0; Lane < Lanes; ++Lane) {		for (unsigned Lane = 0; Lane < Lanes; ++Lane) {

// Start if-block.		// Start if-block.
▲ Show 20 Lines • Show All 1,565 Lines • ▼ Show 20 Lines	case InductionDescriptor::IK_PtrInduction: {
// Handle the pointer induction variable case.		// Handle the pointer induction variable case.
assert(P->getType()->isPointerTy() && "Unexpected type.");		assert(P->getType()->isPointerTy() && "Unexpected type.");
// This is the normalized GEP that starts counting at zero.		// This is the normalized GEP that starts counting at zero.
Value *PtrInd = Induction;		Value *PtrInd = Induction;
PtrInd = Builder.CreateSExtOrTrunc(PtrInd, II.getStep()->getType());		PtrInd = Builder.CreateSExtOrTrunc(PtrInd, II.getStep()->getType());
// Determine the number of scalars we need to generate for each unroll		// Determine the number of scalars we need to generate for each unroll
// iteration. If the instruction is uniform, we only need to generate the		// iteration. If the instruction is uniform, we only need to generate the
// first lane. Otherwise, we generate all VF values.		// first lane. Otherwise, we generate all VF values.
unsigned Lanes = Legal->isUniformAfterVectorization(P) ? 1 : VF;		unsigned Lanes = Cost->isUniformAfterVectorization(P, VF) ? 1 : VF;
// These are the scalar results. Notice that we don't generate vector GEPs		// These are the scalar results. Notice that we don't generate vector GEPs
// because scalar GEPs result in better code.		// because scalar GEPs result in better code.
ScalarParts Entry(UF);		ScalarParts Entry(UF);
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
Entry[Part].resize(VF);		Entry[Part].resize(VF);
for (unsigned Lane = 0; Lane < Lanes; ++Lane) {		for (unsigned Lane = 0; Lane < Lanes; ++Lane) {
Constant Idx = ConstantInt::get(PtrInd->getType(), Lane + Part VF);		Constant Idx = ConstantInt::get(PtrInd->getType(), Lane + Part VF);
Value *GlobalIdx = Builder.CreateAdd(PtrInd, Idx);		Value *GlobalIdx = Builder.CreateAdd(PtrInd, Idx);
▲ Show 20 Lines • Show All 487 Lines • ▼ Show 20 Lines	bool LoopVectorizationLegality::canVectorize() {
// If an override option has been passed in for interleaved accesses, use it.		// If an override option has been passed in for interleaved accesses, use it.
if (EnableInterleavedMemAccesses.getNumOccurrences() > 0)		if (EnableInterleavedMemAccesses.getNumOccurrences() > 0)
UseInterleaved = EnableInterleavedMemAccesses;		UseInterleaved = EnableInterleavedMemAccesses;

// Analyze interleaved memory accesses.		// Analyze interleaved memory accesses.
if (UseInterleaved)		if (UseInterleaved)
InterleaveInfo.analyzeInterleaving(*getSymbolicStrides());		InterleaveInfo.analyzeInterleaving(*getSymbolicStrides());

// Collect all instructions that are known to be uniform after vectorization.
collectLoopUniforms();

// Collect all instructions that are known to be scalar after vectorization.
collectLoopScalars();

unsigned SCEVThreshold = VectorizeSCEVCheckThreshold;		unsigned SCEVThreshold = VectorizeSCEVCheckThreshold;
if (Hints->getForce() == LoopVectorizeHints::FK_Enabled)		if (Hints->getForce() == LoopVectorizeHints::FK_Enabled)
SCEVThreshold = PragmaVectorizeSCEVCheckThreshold;		SCEVThreshold = PragmaVectorizeSCEVCheckThreshold;

if (PSE.getUnionPredicate().getComplexity() > SCEVThreshold) {		if (PSE.getUnionPredicate().getComplexity() > SCEVThreshold) {
ORE->emit(createMissedAnalysis("TooManySCEVRunTimeChecks")		ORE->emit(createMissedAnalysis("TooManySCEVRunTimeChecks")
<< "Too many SCEV assumptions need to be made and checked "		<< "Too many SCEV assumptions need to be made and checked "
<< "at runtime");		<< "at runtime");
▲ Show 20 Lines • Show All 249 Lines • ▼ Show 20 Lines	bool LoopVectorizationLegality::canVectorizeInstrs() {
// is the same size. If it's not, unset it here and InnerLoopVectorizer		// is the same size. If it's not, unset it here and InnerLoopVectorizer
// will create another.		// will create another.
if (Induction && WidestIndTy != Induction->getType())		if (Induction && WidestIndTy != Induction->getType())
Induction = nullptr;		Induction = nullptr;

return true;		return true;
}		}

void LoopVectorizationLegality::collectLoopScalars() {		void LoopVectorizationCostModel::collectLoopScalars(unsigned VF) {

		// We should not collect Scalars more than once per VF. Right now,
		// this function is called from collectUniformsAndScalars(), which
		// already does this check. Collecting Scalars for VF=1 does not make any
		// sense.

		assert(VF >= 2 && !Scalars.count(VF) &&
		"This function should not be visited twice for the same VF");

// If an instruction is uniform after vectorization, it will remain scalar.		// If an instruction is uniform after vectorization, it will remain scalar.
Scalars.insert(Uniforms.begin(), Uniforms.end());		Scalars[VF].insert(Uniforms[VF].begin(), Uniforms[VF].end());

// Collect the getelementptr instructions that will not be vectorized. A		// Collect the getelementptr instructions that will not be vectorized. A
// getelementptr instruction is only vectorized if it is used for a legal		// getelementptr instruction is only vectorized if it is used for a legal
// gather or scatter operation.		// gather or scatter operation.
for (auto *BB : TheLoop->blocks())		for (auto *BB : TheLoop->blocks())
for (auto &I : *BB) {		for (auto &I : *BB) {
if (auto *GEP = dyn_cast<GetElementPtrInst>(&I)) {		if (auto *GEP = dyn_cast<GetElementPtrInst>(&I)) {
Scalars.insert(GEP);		Scalars[VF].insert(GEP);
continue;		continue;
}		}
auto *Ptr = getPointerOperand(&I);		auto *Ptr = getPointerOperand(&I);
if (!Ptr)		if (!Ptr)
continue;		continue;
auto *GEP = getGEPInstruction(Ptr);		auto *GEP = getGEPInstruction(Ptr);
if (GEP && isLegalGatherOrScatter(&I))		if (GEP && getWideningDecision(&I, VF) == CM_GatherScatter)
Scalars.erase(GEP);		Scalars[VF].erase(GEP);
}		}

// An induction variable will remain scalar if all users of the induction		// An induction variable will remain scalar if all users of the induction
// variable and induction variable update remain scalar.		// variable and induction variable update remain scalar.
auto *Latch = TheLoop->getLoopLatch();		auto *Latch = TheLoop->getLoopLatch();
for (auto &Induction : *getInductionVars()) {		for (auto &Induction : *Legal->getInductionVars()) {
auto *Ind = Induction.first;		auto *Ind = Induction.first;
auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));		auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));

// Determine if all users of the induction variable are scalar after		// Determine if all users of the induction variable are scalar after
// vectorization.		// vectorization.
auto ScalarInd = all_of(Ind->users(), [&](User *U) -> bool {		auto ScalarInd = all_of(Ind->users(), [&](User *U) -> bool {
auto *I = cast<Instruction>(U);		auto *I = cast<Instruction>(U);
return I == IndUpdate \|\| !TheLoop->contains(I) \|\| Scalars.count(I);		return I == IndUpdate \|\| !TheLoop->contains(I) \|\| Scalars[VF].count(I);
});		});
if (!ScalarInd)		if (!ScalarInd)
continue;		continue;

// Determine if all users of the induction variable update instruction are		// Determine if all users of the induction variable update instruction are
// scalar after vectorization.		// scalar after vectorization.
auto ScalarIndUpdate = all_of(IndUpdate->users(), [&](User *U) -> bool {		auto ScalarIndUpdate = all_of(IndUpdate->users(), [&](User *U) -> bool {
auto *I = cast<Instruction>(U);		auto *I = cast<Instruction>(U);
return I == Ind \|\| !TheLoop->contains(I) \|\| Scalars.count(I);		return I == Ind \|\| !TheLoop->contains(I) \|\| Scalars[VF].count(I);
});		});
if (!ScalarIndUpdate)		if (!ScalarIndUpdate)
continue;		continue;

// The induction variable and its update instruction will remain scalar.		// The induction variable and its update instruction will remain scalar.
Scalars.insert(Ind);		Scalars[VF].insert(Ind);
Scalars.insert(IndUpdate);		Scalars[VF].insert(IndUpdate);
}
}		}

bool LoopVectorizationLegality::hasConsecutiveLikePtrOperand(Instruction *I) {
if (isAccessInterleaved(I))
return true;
if (auto *Ptr = getPointerOperand(I))
return isConsecutivePtr(Ptr);
return false;
}		}

bool LoopVectorizationLegality::isScalarWithPredication(Instruction *I) {		bool LoopVectorizationLegality::isScalarWithPredication(Instruction *I) {
if (!blockNeedsPredication(I->getParent()))		if (!blockNeedsPredication(I->getParent()))
return false;		return false;
switch(I->getOpcode()) {		switch(I->getOpcode()) {
default:		default:
break;		break;
case Instruction::Store:		case Instruction::Store:
return !isMaskRequired(I);		return !isMaskRequired(I);
case Instruction::UDiv:		case Instruction::UDiv:
case Instruction::SDiv:		case Instruction::SDiv:
case Instruction::SRem:		case Instruction::SRem:
case Instruction::URem:		case Instruction::URem:
return mayDivideByZero(*I);		return mayDivideByZero(*I);
}		}
return false;		return false;
}		}

bool LoopVectorizationLegality::memoryInstructionMustBeScalarized(		bool LoopVectorizationLegality::memoryInstructionCanBeWidened(Instruction *I,
Instruction *I, unsigned VF) {		unsigned VF) {

// If the memory instruction is in an interleaved group, it will be
// vectorized and its pointer will remain uniform.
if (isAccessInterleaved(I))
return false;

// Get and ensure we have a valid memory instruction.		// Get and ensure we have a valid memory instruction.
LoadInst *LI = dyn_cast<LoadInst>(I);		LoadInst *LI = dyn_cast<LoadInst>(I);
StoreInst *SI = dyn_cast<StoreInst>(I);		StoreInst *SI = dyn_cast<StoreInst>(I);
assert((LI \|\| SI) && "Invalid memory instruction");		assert((LI \|\| SI) && "Invalid memory instruction");

// If the pointer operand is uniform (loop invariant), the memory instruction
// will be scalarized.
auto *Ptr = getPointerOperand(I);		auto *Ptr = getPointerOperand(I);
if (LI && isUniform(Ptr))
return true;

// If the pointer operand is non-consecutive and neither a gather nor a		// In order to be widened, the pointer should be consecutive, first of all.
// scatter operation is legal, the memory instruction will be scalarized.		if (!isConsecutivePtr(Ptr))
if (!isConsecutivePtr(Ptr) && !isLegalGatherOrScatter(I))		return false;
return true;

// If the instruction is a store located in a predicated block, it will be		// If the instruction is a store located in a predicated block, it will be
// scalarized.		// scalarized.
if (isScalarWithPredication(I))		if (isScalarWithPredication(I))
return true;		return false;

// If the instruction's allocated size doesn't equal it's type size, it		// If the instruction's allocated size doesn't equal it's type size, it
// requires padding and will be scalarized.		// requires padding and will be scalarized.
auto &DL = I->getModule()->getDataLayout();		auto &DL = I->getModule()->getDataLayout();
auto *ScalarTy = LI ? LI->getType() : SI->getValueOperand()->getType();		auto *ScalarTy = LI ? LI->getType() : SI->getValueOperand()->getType();
if (hasIrregularType(ScalarTy, DL, VF))		if (hasIrregularType(ScalarTy, DL, VF))
return true;

// Otherwise, the memory instruction should be vectorized if the rest of the
// loop is.
return false;		return false;

		return true;
}		}

void LoopVectorizationLegality::collectLoopUniforms() {		void LoopVectorizationCostModel::collectLoopUniforms(unsigned VF) {

		// We should not collect Uniforms more than once per VF. Right now,
		// this function is called from collectUniformsAndScalars(), which
		// already does this check. Collecting Uniforms for VF=1 does not make any
		// sense.

		assert(VF >= 2 && !Uniforms.count(VF) &&
		"This function should not be visited twice for the same VF");

		// Visit the list of Uniforms. If we'll not find any uniform value, we'll
		// not analyze again. Uniforms.count(VF) will return 1.
		Uniforms[VF].clear();

// We now know that the loop is vectorizable!		// We now know that the loop is vectorizable!
// Collect instructions inside the loop that will remain uniform after		// Collect instructions inside the loop that will remain uniform after
// vectorization.		// vectorization.

// Global values, params and instructions outside of current loop are out of		// Global values, params and instructions outside of current loop are out of
// scope.		// scope.
auto isOutOfScope = [&](Value *V) -> bool {		auto isOutOfScope = [&](Value *V) -> bool {
Instruction *I = dyn_cast<Instruction>(V);		Instruction *I = dyn_cast<Instruction>(V);
Show All 16 Lines	void LoopVectorizationCostModel::collectLoopUniforms(unsigned VF) {
// are pointers that are treated like consecutive pointers during		// are pointers that are treated like consecutive pointers during
// vectorization. The pointer operands of interleaved accesses are an		// vectorization. The pointer operands of interleaved accesses are an
// example.		// example.
SmallSetVector<Instruction *, 8> ConsecutiveLikePtrs;		SmallSetVector<Instruction *, 8> ConsecutiveLikePtrs;

// Holds pointer operands of instructions that are possibly non-uniform.		// Holds pointer operands of instructions that are possibly non-uniform.
SmallPtrSet<Instruction *, 8> PossibleNonUniformPtrs;		SmallPtrSet<Instruction *, 8> PossibleNonUniformPtrs;

		auto isUniformDecision = [&](Instruction *I, unsigned VF) {
		InstWidening WideningDecision = getWideningDecision(I, VF);
		assert(WideningDecision != CM_Unknown &&
		"Widening decision should be ready at this moment");

		return (WideningDecision == CM_Widen \|\|
		WideningDecision == CM_Interleave);
		};
// Iterate over the instructions in the loop, and collect all		// Iterate over the instructions in the loop, and collect all
// consecutive-like pointer operands in ConsecutiveLikePtrs. If it's possible		// consecutive-like pointer operands in ConsecutiveLikePtrs. If it's possible
// that a consecutive-like pointer operand will be scalarized, we collect it		// that a consecutive-like pointer operand will be scalarized, we collect it
// in PossibleNonUniformPtrs instead. We use two sets here because a single		// in PossibleNonUniformPtrs instead. We use two sets here because a single
// getelementptr instruction can be used by both vectorized and scalarized		// getelementptr instruction can be used by both vectorized and scalarized
// memory instructions. For example, if a loop loads and stores from the same		// memory instructions. For example, if a loop loads and stores from the same
// location, but the store is conditional, the store will be scalarized, and		// location, but the store is conditional, the store will be scalarized, and
// the getelementptr won't remain uniform.		// the getelementptr won't remain uniform.
for (auto *BB : TheLoop->blocks())		for (auto *BB : TheLoop->blocks())
for (auto &I : *BB) {		for (auto &I : *BB) {

// If there's no pointer operand, there's nothing to do.		// If there's no pointer operand, there's nothing to do.
auto *Ptr = dyn_cast_or_null<Instruction>(getPointerOperand(&I));		auto *Ptr = dyn_cast_or_null<Instruction>(getPointerOperand(&I));
if (!Ptr)		if (!Ptr)
continue;		continue;

// True if all users of Ptr are memory accesses that have Ptr as their		// True if all users of Ptr are memory accesses that have Ptr as their
// pointer operand.		// pointer operand.
auto UsersAreMemAccesses = all_of(Ptr->users(), [&](User *U) -> bool {		auto UsersAreMemAccesses = all_of(Ptr->users(), [&](User *U) -> bool {
return getPointerOperand(U) == Ptr;		return getPointerOperand(U) == Ptr;
});		});

// Ensure the memory instruction will not be scalarized, making its		// Ensure the memory instruction will not be scalarized or used by
// pointer operand non-uniform. If the pointer operand is used by some		// gather/scatter, making its pointer operand non-uniform. If the pointer
// instruction other than a memory access, we're not going to check if		// operand is used by any instruction other than a memory access, we
// that other instruction may be scalarized here. Thus, conservatively		// conservatively assume the pointer operand may be non-uniform.
// assume the pointer operand may be non-uniform.		if (!UsersAreMemAccesses \|\| !isUniformDecision(&I, VF))
if (!UsersAreMemAccesses \|\| memoryInstructionMustBeScalarized(&I))
PossibleNonUniformPtrs.insert(Ptr);		PossibleNonUniformPtrs.insert(Ptr);

// If the memory instruction will be vectorized and its pointer operand		// If the memory instruction will be vectorized and its pointer operand
// is consecutive-like, the pointer operand should remain uniform.		// is consecutive-like, or interleaving - the pointer operand should
else if (hasConsecutiveLikePtrOperand(&I))		// remain uniform.
ConsecutiveLikePtrs.insert(Ptr);

// Otherwise, if the memory instruction will be vectorized and its
// pointer operand is non-consecutive-like, the memory instruction should
// be a gather or scatter operation. Its pointer operand will be
// non-uniform.
else		else
PossibleNonUniformPtrs.insert(Ptr);		ConsecutiveLikePtrs.insert(Ptr);
}		}

// Add to the Worklist all consecutive and consecutive-like pointers that		// Add to the Worklist all consecutive and consecutive-like pointers that
// aren't also identified as possibly non-uniform.		// aren't also identified as possibly non-uniform.
for (auto *V : ConsecutiveLikePtrs)		for (auto *V : ConsecutiveLikePtrs)
if (!PossibleNonUniformPtrs.count(V)) {		if (!PossibleNonUniformPtrs.count(V)) {
DEBUG(dbgs() << "LV: Found uniform instruction: " << *V << "\n");		DEBUG(dbgs() << "LV: Found uniform instruction: " << *V << "\n");
Worklist.insert(V);		Worklist.insert(V);
Show All 18 Lines	for (auto OV : I->operand_values()) {
DEBUG(dbgs() << "LV: Found uniform instruction: " << *OI << "\n");		DEBUG(dbgs() << "LV: Found uniform instruction: " << *OI << "\n");
}		}
}		}
}		}

// Returns true if Ptr is the pointer operand of a memory access instruction		// Returns true if Ptr is the pointer operand of a memory access instruction
// I, and I is known to not require scalarization.		// I, and I is known to not require scalarization.
auto isVectorizedMemAccessUse = [&](Instruction I, Value Ptr) -> bool {		auto isVectorizedMemAccessUse = [&](Instruction I, Value Ptr) -> bool {
return getPointerOperand(I) == Ptr && !memoryInstructionMustBeScalarized(I);		return getPointerOperand(I) == Ptr && isUniformDecision(I, VF);
};		};

// For an instruction to be added into Worklist above, all its users inside		// For an instruction to be added into Worklist above, all its users inside
// the loop should also be in Worklist. However, this condition cannot be		// the loop should also be in Worklist. However, this condition cannot be
// true for phi nodes that form a cyclic dependence. We must process phi		// true for phi nodes that form a cyclic dependence. We must process phi
// nodes separately. An induction variable will remain uniform if all users		// nodes separately. An induction variable will remain uniform if all users
// of the induction variable and induction variable update remain uniform.		// of the induction variable and induction variable update remain uniform.
// The code below handles both pointer and non-pointer induction variables.		// The code below handles both pointer and non-pointer induction variables.
for (auto &Induction : Inductions) {		for (auto &Induction : *Legal->getInductionVars()) {
auto *Ind = Induction.first;		auto *Ind = Induction.first;
auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));		auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));

// Determine if all users of the induction variable are uniform after		// Determine if all users of the induction variable are uniform after
// vectorization.		// vectorization.
auto UniformInd = all_of(Ind->users(), [&](User *U) -> bool {		auto UniformInd = all_of(Ind->users(), [&](User *U) -> bool {
auto *I = cast<Instruction>(U);		auto *I = cast<Instruction>(U);
return I == IndUpdate \|\| !TheLoop->contains(I) \|\| Worklist.count(I) \|\|		return I == IndUpdate \|\| !TheLoop->contains(I) \|\| Worklist.count(I) \|\|
Show All 14 Lines	for (auto &Induction : *Legal->getInductionVars()) {

// The induction variable and its update instruction will remain uniform.		// The induction variable and its update instruction will remain uniform.
Worklist.insert(Ind);		Worklist.insert(Ind);
Worklist.insert(IndUpdate);		Worklist.insert(IndUpdate);
DEBUG(dbgs() << "LV: Found uniform instruction: " << *Ind << "\n");		DEBUG(dbgs() << "LV: Found uniform instruction: " << *Ind << "\n");
DEBUG(dbgs() << "LV: Found uniform instruction: " << *IndUpdate << "\n");		DEBUG(dbgs() << "LV: Found uniform instruction: " << *IndUpdate << "\n");
}		}

Uniforms.insert(Worklist.begin(), Worklist.end());		Uniforms[VF].insert(Worklist.begin(), Worklist.end());
}		}

bool LoopVectorizationLegality::canVectorizeMemory() {		bool LoopVectorizationLegality::canVectorizeMemory() {
LAI = &(GetLAA)(TheLoop);		LAI = &(GetLAA)(TheLoop);
InterleaveInfo.setLAI(LAI);		InterleaveInfo.setLAI(LAI);
const OptimizationRemarkAnalysis *LAR = LAI->getReport();		const OptimizationRemarkAnalysis *LAR = LAI->getReport();
if (LAR) {		if (LAR) {
OptimizationRemarkAnalysis VR(Hints->vectorizeAnalysisPassName(),		OptimizationRemarkAnalysis VR(Hints->vectorizeAnalysisPassName(),
▲ Show 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	for (auto &I : *BB) {
int64_t Stride = getPtrStride(PSE, Ptr, TheLoop, Strides,		int64_t Stride = getPtrStride(PSE, Ptr, TheLoop, Strides,
/Assume=/true, /ShouldCheckWrap=/false);		/Assume=/true, /ShouldCheckWrap=/false);

const SCEV *Scev = replaceSymbolicStrideSCEV(PSE, Strides, Ptr);		const SCEV *Scev = replaceSymbolicStrideSCEV(PSE, Strides, Ptr);
PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());		PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());
uint64_t Size = DL.getTypeAllocSize(PtrTy->getElementType());		uint64_t Size = DL.getTypeAllocSize(PtrTy->getElementType());

// An alignment of 0 means target ABI alignment.		// An alignment of 0 means target ABI alignment.
unsigned Align = LI ? LI->getAlignment() : SI->getAlignment();		unsigned Align = getMemInstAlignment(&I);
if (!Align)		if (!Align)
Align = DL.getABITypeAlignment(PtrTy->getElementType());		Align = DL.getABITypeAlignment(PtrTy->getElementType());

AccessStrideInfo[&I] = StrideDescriptor(Stride, Scev, Size, Align);		AccessStrideInfo[&I] = StrideDescriptor(Stride, Scev, Size, Align);
}		}
}		}

// Analyze interleaved accesses and collect them into interleaved load and		// Analyze interleaved accesses and collect them into interleaved load and
▲ Show 20 Lines • Show All 225 Lines • ▼ Show 20 Lines	for (InterleaveGroup *Group : LoadGroups) {
if (LastMember) {		if (LastMember) {
Value *LastMemberPtr = getPointerOperand(LastMember);		Value *LastMemberPtr = getPointerOperand(LastMember);
if (!getPtrStride(PSE, LastMemberPtr, TheLoop, Strides, /Assume=/false,		if (!getPtrStride(PSE, LastMemberPtr, TheLoop, Strides, /Assume=/false,
/ShouldCheckWrap=/true)) {		/ShouldCheckWrap=/true)) {
DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to "		DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to "
"last group member potentially pointer-wrapping.\n");		"last group member potentially pointer-wrapping.\n");
releaseGroup(Group);		releaseGroup(Group);
}		}
}		} else {
else {
// Case 3: A non-reversed interleaved load group with gaps: We need		// Case 3: A non-reversed interleaved load group with gaps: We need
// to execute at least one scalar epilogue iteration. This will ensure		// to execute at least one scalar epilogue iteration. This will ensure
// we don't speculatively access memory out-of-bounds. We only need		// we don't speculatively access memory out-of-bounds. We only need
// to look for a member at index factor - 1, since every group must have		// to look for a member at index factor - 1, since every group must have
// a member at index zero.		// a member at index zero.
if (Group->isReverse()) {		if (Group->isReverse()) {
releaseGroup(Group);		releaseGroup(Group);
continue;		continue;
▲ Show 20 Lines • Show All 113 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) {
}		}

int UserVF = Hints->getWidth();		int UserVF = Hints->getWidth();
if (UserVF != 0) {		if (UserVF != 0) {
assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");		assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");
DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");		DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");

Factor.Width = UserVF;		Factor.Width = UserVF;

		collectUniformsAndScalars(UserVF);
collectInstsToScalarize(UserVF);		collectInstsToScalarize(UserVF);
return Factor;		return Factor;
}		}

float Cost = expectedCost(1).first;		float Cost = expectedCost(1).first;
#ifndef NDEBUG		#ifndef NDEBUG
const float ScalarCost = Cost;		const float ScalarCost = Cost;
#endif /* NDEBUG */		#endif /* NDEBUG */
▲ Show 20 Lines • Show All 350 Lines • ▼ Show 20 Lines	if (ValuesToIgnore.count(I))
continue;		continue;

// For each VF find the maximum usage of registers.		// For each VF find the maximum usage of registers.
for (unsigned j = 0, e = VFs.size(); j < e; ++j) {		for (unsigned j = 0, e = VFs.size(); j < e; ++j) {
if (VFs[j] == 1) {		if (VFs[j] == 1) {
MaxUsages[j] = std::max(MaxUsages[j], OpenIntervals.size());		MaxUsages[j] = std::max(MaxUsages[j], OpenIntervals.size());
continue;		continue;
}		}
		collectUniformsAndScalars(VFs[j]);
// Count the number of live intervals.		// Count the number of live intervals.
unsigned RegUsage = 0;		unsigned RegUsage = 0;
for (auto Inst : OpenIntervals) {		for (auto Inst : OpenIntervals) {
// Skip ignored values for VF > 1.		// Skip ignored values for VF > 1.
if (VecValuesToIgnore.count(Inst))		if (VecValuesToIgnore.count(Inst) \|\|
		isScalarAfterVectorization(Inst, VFs[j]))
continue;		continue;
RegUsage += GetRegUsage(Inst->getType(), VFs[j]);		RegUsage += GetRegUsage(Inst->getType(), VFs[j]);
}		}
MaxUsages[j] = std::max(MaxUsages[j], RegUsage);		MaxUsages[j] = std::max(MaxUsages[j], RegUsage);
}		}

DEBUG(dbgs() << "LV(REG): At #" << i << " Interval # "		DEBUG(dbgs() << "LV(REG): At #" << i << " Interval # "
<< OpenIntervals.size() << '\n');		<< OpenIntervals.size() << '\n');
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB)
}		}
}		}
}		}

int LoopVectorizationCostModel::computePredInstDiscount(		int LoopVectorizationCostModel::computePredInstDiscount(
Instruction PredInst, DenseMap<Instruction , unsigned> &ScalarCosts,		Instruction PredInst, DenseMap<Instruction , unsigned> &ScalarCosts,
unsigned VF) {		unsigned VF) {

assert(!Legal->isUniformAfterVectorization(PredInst) &&		assert(!isUniformAfterVectorization(PredInst, VF) &&
"Instruction marked uniform-after-vectorization will be predicated");		"Instruction marked uniform-after-vectorization will be predicated");

// Initialize the discount to zero, meaning that the scalar version and the		// Initialize the discount to zero, meaning that the scalar version and the
// vector version cost the same.		// vector version cost the same.
int Discount = 0;		int Discount = 0;

// Holds instructions to analyze. The instructions we visit are mapped in		// Holds instructions to analyze. The instructions we visit are mapped in
// ScalarCosts. Those instructions are the ones that would be scalarized if		// ScalarCosts. Those instructions are the ones that would be scalarized if
// we find that the scalar version costs less.		// we find that the scalar version costs less.
SmallVector<Instruction *, 8> Worklist;		SmallVector<Instruction *, 8> Worklist;

// Returns true if the given instruction can be scalarized.		// Returns true if the given instruction can be scalarized.
auto canBeScalarized = [&](Instruction *I) -> bool {		auto canBeScalarized = [&](Instruction *I) -> bool {

// We only attempt to scalarize instructions forming a single-use chain		// We only attempt to scalarize instructions forming a single-use chain
// from the original predicated block that would otherwise be vectorized.		// from the original predicated block that would otherwise be vectorized.
// Although not strictly necessary, we give up on instructions we know will		// Although not strictly necessary, we give up on instructions we know will
// already be scalar to avoid traversing chains that are unlikely to be		// already be scalar to avoid traversing chains that are unlikely to be
// beneficial.		// beneficial.
if (!I->hasOneUse() \|\| PredInst->getParent() != I->getParent() \|\|		if (!I->hasOneUse() \|\| PredInst->getParent() != I->getParent() \|\|
Legal->isScalarAfterVectorization(I))		isScalarAfterVectorization(I, VF))
return false;		return false;

// If the instruction is scalar with predication, it will be analyzed		// If the instruction is scalar with predication, it will be analyzed
// separately. We ignore it within the context of PredInst.		// separately. We ignore it within the context of PredInst.
if (Legal->isScalarWithPredication(I))		if (Legal->isScalarWithPredication(I))
return false;		return false;

// If any of the instruction's operands are uniform after vectorization,		// If any of the instruction's operands are uniform after vectorization,
// the instruction cannot be scalarized. This prevents, for example, a		// the instruction cannot be scalarized. This prevents, for example, a
// masked load from being scalarized.		// masked load from being scalarized.
//		//
// We assume we will only emit a value for lane zero of an instruction		// We assume we will only emit a value for lane zero of an instruction
// marked uniform after vectorization, rather than VF identical values.		// marked uniform after vectorization, rather than VF identical values.
// Thus, if we scalarize an instruction that uses a uniform, we would		// Thus, if we scalarize an instruction that uses a uniform, we would
// create uses of values corresponding to the lanes we aren't emitting code		// create uses of values corresponding to the lanes we aren't emitting code
// for. This behavior can be changed by allowing getScalarValue to clone		// for. This behavior can be changed by allowing getScalarValue to clone
// the lane zero values for uniforms rather than asserting.		// the lane zero values for uniforms rather than asserting.
for (Use &U : I->operands())		for (Use &U : I->operands())
if (auto *J = dyn_cast<Instruction>(U.get()))		if (auto *J = dyn_cast<Instruction>(U.get()))
if (Legal->isUniformAfterVectorization(J))		if (isUniformAfterVectorization(J, VF))
return false;		return false;

// Otherwise, we can scalarize the instruction.		// Otherwise, we can scalarize the instruction.
return true;		return true;
};		};

// Returns true if an operand that cannot be scalarized must be extracted		// Returns true if an operand that cannot be scalarized must be extracted
// from a vector. We will account for this scalarization overhead below. Note		// from a vector. We will account for this scalarization overhead below. Note
// that the non-void predicated instructions are placed in their own blocks,		// that the non-void predicated instructions are placed in their own blocks,
// and their return values are inserted into vectors. Thus, an extract would		// and their return values are inserted into vectors. Thus, an extract would
// still be required.		// still be required.
auto needsExtract = [&](Instruction *I) -> bool {		auto needsExtract = [&](Instruction *I) -> bool {
return TheLoop->contains(I) && !Legal->isScalarAfterVectorization(I);		return TheLoop->contains(I) && !isScalarAfterVectorization(I, VF);
};		};

// Compute the expected cost discount from scalarizing the entire expression		// Compute the expected cost discount from scalarizing the entire expression
// feeding the predicated instruction. We currently only consider expressions		// feeding the predicated instruction. We currently only consider expressions
// that are single-use instruction chains.		// that are single-use instruction chains.
Worklist.push_back(PredInst);		Worklist.push_back(PredInst);
while (!Worklist.empty()) {		while (!Worklist.empty()) {
Instruction *I = Worklist.pop_back_val();		Instruction *I = Worklist.pop_back_val();
▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	int LoopVectorizationCostModel::computePredInstDiscount(

return Discount;		return Discount;
}		}

LoopVectorizationCostModel::VectorizationCostTy		LoopVectorizationCostModel::VectorizationCostTy
LoopVectorizationCostModel::expectedCost(unsigned VF) {		LoopVectorizationCostModel::expectedCost(unsigned VF) {
VectorizationCostTy Cost;		VectorizationCostTy Cost;

		// Collect Uniform and Scalar instructions after vectorization with VF.
		collectUniformsAndScalars(VF);

// Collect the instructions (and their associated costs) that will be more		// Collect the instructions (and their associated costs) that will be more
// profitable to scalarize.		// profitable to scalarize.
collectInstsToScalarize(VF);		collectInstsToScalarize(VF);

// For each block.		// For each block.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
VectorizationCostTy BlockCost;		VectorizationCostTy BlockCost;

▲ Show 20 Lines • Show All 63 Lines • ▼ Show 20 Lines	static const SCEV *getAddressAccessSCEV(
return SE->getSCEV(Ptr);		return SE->getSCEV(Ptr);
}		}

static bool isStrideMul(Instruction I, LoopVectorizationLegality Legal) {		static bool isStrideMul(Instruction I, LoopVectorizationLegality Legal) {
return Legal->hasStride(I->getOperand(0)) \|\|		return Legal->hasStride(I->getOperand(0)) \|\|
Legal->hasStride(I->getOperand(1));		Legal->hasStride(I->getOperand(1));
}		}

		unsigned LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
		unsigned VF) {
		Type *ValTy = getMemInstValueType(I);
		auto SE = PSE.getSE();

		unsigned Alignment = getMemInstAlignment(I);
		unsigned AS = getMemInstAddressSpace(I);
		Value *Ptr = getPointerOperand(I);
		Type *PtrTy = ToVectorTy(Ptr->getType(), VF);

		// Figure out whether the access is strided and get the stride value
		// if it's known in compile time
		const SCEV *PtrSCEV = getAddressAccessSCEV(Ptr, Legal, SE, TheLoop);

		// Get the cost of the scalar memory instruction and address computation.
		unsigned Cost = VF * TTI.getAddressComputationCost(PtrTy, SE, PtrSCEV);

		Cost += VF *
		TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(), Alignment,
		AS);

		// Get the overhead of the extractelement and insertelement instructions
		// we might create due to scalarization.
		Cost += getScalarizationOverhead(I, VF, TTI);

		// If we have a predicated store, it may not be executed for each vector
		// lane. Scale the cost by the probability of executing the predicated
		// block.
		if (Legal->isScalarWithPredication(I))
		Cost /= getReciprocalPredBlockProb();

		return Cost;
		}

		unsigned LoopVectorizationCostModel::getConsecutiveMemOpCost(Instruction *I,
		unsigned VF) {
		Type *ValTy = getMemInstValueType(I);
		Type *VectorTy = ToVectorTy(ValTy, VF);
		unsigned Alignment = getMemInstAlignment(I);
		Value *Ptr = getPointerOperand(I);
		unsigned AS = getMemInstAddressSpace(I);
		int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);

		assert((ConsecutiveStride == 1 \|\| ConsecutiveStride == -1) &&
		"Stride should be 1 or -1 for consecutive memory access");
		unsigned Cost = 0;
		if (Legal->isMaskRequired(I))
		Cost += TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);
		else
		Cost += TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);

		bool Reverse = ConsecutiveStride < 0;
		if (Reverse)
		Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
		return Cost;
		}

		unsigned LoopVectorizationCostModel::getUniformMemOpCost(Instruction *I,
		unsigned VF) {
		LoadInst *LI = cast<LoadInst>(I);
		Type *ValTy = LI->getType();
		Type *VectorTy = ToVectorTy(ValTy, VF);
		unsigned Alignment = LI->getAlignment();
		unsigned AS = LI->getPointerAddressSpace();

		return TTI.getAddressComputationCost(ValTy) +
		TTI.getMemoryOpCost(Instruction::Load, ValTy, Alignment, AS) +
		TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, VectorTy);
		}

		unsigned LoopVectorizationCostModel::getGatherScatterCost(Instruction *I,
		unsigned VF) {
		Type *ValTy = getMemInstValueType(I);
		Type *VectorTy = ToVectorTy(ValTy, VF);
		unsigned Alignment = getMemInstAlignment(I);
		Value *Ptr = getPointerOperand(I);

		return TTI.getAddressComputationCost(VectorTy) +
		TTI.getGatherScatterOpCost(I->getOpcode(), VectorTy, Ptr,
		Legal->isMaskRequired(I), Alignment);
		}

		unsigned LoopVectorizationCostModel::getInterleaveGroupCost(Instruction *I,
		unsigned VF) {
		Type *ValTy = getMemInstValueType(I);
		Type *VectorTy = ToVectorTy(ValTy, VF);
		unsigned AS = getMemInstAddressSpace(I);

		auto Group = Legal->getInterleavedAccessGroup(I);
		assert(Group && "Fail to get an interleaved access group.");

		unsigned InterleaveFactor = Group->getFactor();
		Type WideVecTy = VectorType::get(ValTy, VF InterleaveFactor);

		// Holds the indices of existing members in an interleaved load group.
		// An interleaved store group doesn't need this as it doesn't allow gaps.
		SmallVector<unsigned, 4> Indices;
		if (isa<LoadInst>(I)) {
		for (unsigned i = 0; i < InterleaveFactor; i++)
		if (Group->getMember(i))
		Indices.push_back(i);
		}

		// Calculate the cost of the whole interleaved group.
		unsigned Cost = TTI.getInterleavedMemoryOpCost(I->getOpcode(), WideVecTy,
		Group->getFactor(), Indices,
		Group->getAlignment(), AS);

		if (Group->isReverse())
		Cost += Group->getNumMembers() *
		TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
		return Cost;
		}

		unsigned LoopVectorizationCostModel::getMemoryInstructionCost(Instruction *I,
		unsigned VF) {

		// Calculate scalar cost only. Vectorization cost should be ready at this
		// moment.
		if (VF == 1) {
		Type *ValTy = getMemInstValueType(I);
		unsigned Alignment = getMemInstAlignment(I);
		unsigned AS = getMemInstAlignment(I);

		return TTI.getAddressComputationCost(ValTy) +
		TTI.getMemoryOpCost(I->getOpcode(), ValTy, Alignment, AS);
		}
		return getWideningCost(I, VF);
		}

LoopVectorizationCostModel::VectorizationCostTy		LoopVectorizationCostModel::VectorizationCostTy
LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {		LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {
// If we know that this instruction will remain uniform, check the cost of		// If we know that this instruction will remain uniform, check the cost of
// the scalar version.		// the scalar version.
if (Legal->isUniformAfterVectorization(I))		if (isUniformAfterVectorization(I, VF))
VF = 1;		VF = 1;

if (VF > 1 && isProfitableToScalarize(I, VF))		if (VF > 1 && isProfitableToScalarize(I, VF))
return VectorizationCostTy(InstsToScalarize[VF][I], false);		return VectorizationCostTy(InstsToScalarize[VF][I], false);

Type *VectorTy;		Type *VectorTy;
unsigned C = getInstructionCost(I, VF, VectorTy);		unsigned C = getInstructionCost(I, VF, VectorTy);

bool TypeNotScalarized =		bool TypeNotScalarized =
VF > 1 && !VectorTy->isVoidTy() && TTI.getNumberOfParts(VectorTy) < VF;		VF > 1 && !VectorTy->isVoidTy() && TTI.getNumberOfParts(VectorTy) < VF;
return VectorizationCostTy(C, TypeNotScalarized);		return VectorizationCostTy(C, TypeNotScalarized);
}		}

		void LoopVectorizationCostModel::setCostBasedWideningDecision(unsigned VF) {
		if (VF == 1)
		return;
		for (BasicBlock *BB : TheLoop->blocks()) {
		// For each instruction in the old loop.
		for (Instruction &I : *BB) {
		Value *Ptr = getPointerOperand(&I);
		if (!Ptr)
		continue;

		if (isa<LoadInst>(&I) && Legal->isUniform(Ptr)) {
		// Scalar load + broadcast
		unsigned Cost = getUniformMemOpCost(&I, VF);
		setWideningDecision(&I, VF, CM_Scalarize, Cost);
		continue;
		}

		// We assume that widening is the best solution when possible.
		if (Legal->memoryInstructionCanBeWidened(&I, VF)) {
		unsigned Cost = getConsecutiveMemOpCost(&I, VF);
		setWideningDecision(&I, VF, CM_Widen, Cost);
		continue;
		}

		// Choose between Interleaving, Gather/Scatter or Scalarization.
		unsigned InterleaveCost = UINT_MAX;
		unsigned NumAccesses = 1;
		if (Legal->isAccessInterleaved(&I)) {
		auto Group = Legal->getInterleavedAccessGroup(&I);
		assert(Group && "Fail to get an interleaved access group.");

		// Make one decision for the whole group.
		if (getWideningDecision(&I, VF) != CM_Unknown)
		continue;

		NumAccesses = Group->getNumMembers();
		InterleaveCost = getInterleaveGroupCost(&I, VF);
		}

		unsigned GatherScatterCost =
		Legal->isLegalGatherOrScatter(&I)
		? getGatherScatterCost(&I, VF) * NumAccesses
		: UINT_MAX;

		unsigned ScalarizationCost =
		getMemInstScalarizationCost(&I, VF) * NumAccesses;

		// Choose better solution for the current VF,
		// write down this decision and use it during vectorization.
		unsigned Cost;
		InstWidening Decision;
		if (InterleaveCost <= GatherScatterCost &&
		InterleaveCost < ScalarizationCost) {
		Decision = CM_Interleave;
		Cost = InterleaveCost;
		} else if (GatherScatterCost < ScalarizationCost) {
		Decision = CM_GatherScatter;
		Cost = GatherScatterCost;
		} else {
		Decision = CM_Scalarize;
		Cost = ScalarizationCost;
		}
		// If the instructions belongs to an interleave group, the whole group
		// receives the same decision. The whole group receives the cost, but
		// the cost will actually be assigned to one instruction.
		if (auto Group = Legal->getInterleavedAccessGroup(&I))
		setWideningDecision(Group, VF, Decision, Cost);
		else
		setWideningDecision(&I, VF, Decision, Cost);
		}
		}
		}

unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,		unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,
unsigned VF,		unsigned VF,
Type *&VectorTy) {		Type *&VectorTy) {
Type *RetTy = I->getType();		Type *RetTy = I->getType();
if (canTruncateToMinimalBitwidth(I, VF))		if (canTruncateToMinimalBitwidth(I, VF))
RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);		RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);
VectorTy = ToVectorTy(RetTy, VF);		VectorTy = ToVectorTy(RetTy, VF);
auto SE = PSE.getSE();		auto SE = PSE.getSE();
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	case Instruction::FCmp: {
Instruction *Op0AsInstruction = dyn_cast<Instruction>(I->getOperand(0));		Instruction *Op0AsInstruction = dyn_cast<Instruction>(I->getOperand(0));
if (canTruncateToMinimalBitwidth(Op0AsInstruction, VF))		if (canTruncateToMinimalBitwidth(Op0AsInstruction, VF))
ValTy = IntegerType::get(ValTy->getContext(), MinBWs[Op0AsInstruction]);		ValTy = IntegerType::get(ValTy->getContext(), MinBWs[Op0AsInstruction]);
VectorTy = ToVectorTy(ValTy, VF);		VectorTy = ToVectorTy(ValTy, VF);
return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy);		return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy);
}		}
case Instruction::Store:		case Instruction::Store:
case Instruction::Load: {		case Instruction::Load: {
StoreInst *SI = dyn_cast<StoreInst>(I);		VectorTy = ToVectorTy(getMemInstValueType(I), VF);
LoadInst *LI = dyn_cast<LoadInst>(I);		return getMemoryInstructionCost(I, VF);
Type *ValTy = (SI ? SI->getValueOperand()->getType() : LI->getType());
VectorTy = ToVectorTy(ValTy, VF);

unsigned Alignment = SI ? SI->getAlignment() : LI->getAlignment();
unsigned AS =
SI ? SI->getPointerAddressSpace() : LI->getPointerAddressSpace();
Value *Ptr = getPointerOperand(I);
// We add the cost of address computation here instead of with the gep
// instruction because only here we know whether the operation is
// scalarized.
if (VF == 1)
return TTI.getAddressComputationCost(VectorTy) +
TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);

if (LI && Legal->isUniform(Ptr)) {
// Scalar load + broadcast
unsigned Cost = TTI.getAddressComputationCost(ValTy->getScalarType());
Cost += TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(),
Alignment, AS);
return Cost +
TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, ValTy);
}

// For an interleaved access, calculate the total cost of the whole
// interleave group.
if (Legal->isAccessInterleaved(I)) {
auto Group = Legal->getInterleavedAccessGroup(I);
assert(Group && "Fail to get an interleaved access group.");

// Only calculate the cost once at the insert position.
if (Group->getInsertPos() != I)
return 0;

unsigned InterleaveFactor = Group->getFactor();
Type *WideVecTy =
VectorType::get(VectorTy->getVectorElementType(),
VectorTy->getVectorNumElements() * InterleaveFactor);

// Holds the indices of existing members in an interleaved load group.
// An interleaved store group doesn't need this as it doesn't allow gaps.
SmallVector<unsigned, 4> Indices;
if (LI) {
for (unsigned i = 0; i < InterleaveFactor; i++)
if (Group->getMember(i))
Indices.push_back(i);
}

// Calculate the cost of the whole interleaved group.
unsigned Cost = TTI.getInterleavedMemoryOpCost(
I->getOpcode(), WideVecTy, Group->getFactor(), Indices,
Group->getAlignment(), AS);

if (Group->isReverse())
Cost +=
Group->getNumMembers() *
TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);

// FIXME: The interleaved load group with a huge gap could be even more
// expensive than scalar operations. Then we could ignore such group and
// use scalar operations instead.
return Cost;
}

// Check if the memory instruction will be scalarized.
if (Legal->memoryInstructionMustBeScalarized(I, VF)) {
unsigned Cost = 0;
Type *PtrTy = ToVectorTy(Ptr->getType(), VF);

// Figure out whether the access is strided and get the stride value
// if it's known in compile time
const SCEV *PtrSCEV = getAddressAccessSCEV(Ptr, Legal, SE, TheLoop);

// Get the cost of the scalar memory instruction and address computation.
Cost += VF * TTI.getAddressComputationCost(PtrTy, SE, PtrSCEV);
Cost += VF *
TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(),
Alignment, AS);

// Get the overhead of the extractelement and insertelement instructions
// we might create due to scalarization.
Cost += getScalarizationOverhead(I, VF, TTI);

// If we have a predicated store, it may not be executed for each vector
// lane. Scale the cost by the probability of executing the predicated
// block.
if (Legal->isScalarWithPredication(I))
Cost /= getReciprocalPredBlockProb();

return Cost;
}

// Determine if the pointer operand of the access is either consecutive or
// reverse consecutive.
int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);
bool Reverse = ConsecutiveStride < 0;

// Determine if either a gather or scatter operation is legal.
bool UseGatherOrScatter =
!ConsecutiveStride && Legal->isLegalGatherOrScatter(I);

unsigned Cost = TTI.getAddressComputationCost(VectorTy);
if (UseGatherOrScatter) {
assert(ConsecutiveStride == 0 &&
"Gather/Scatter are not used for consecutive stride");
return Cost +
TTI.getGatherScatterOpCost(I->getOpcode(), VectorTy, Ptr,
Legal->isMaskRequired(I), Alignment);
}
// Wide load/stores.
if (Legal->isMaskRequired(I))
Cost +=
TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);
else
Cost += TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);

if (Reverse)
Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
return Cost;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
case Instruction::PtrToInt:		case Instruction::PtrToInt:
case Instruction::IntToPtr:		case Instruction::IntToPtr:
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectValuesToIgnore() {

// Ignore type-promoting instructions we identified during reduction		// Ignore type-promoting instructions we identified during reduction
// detection.		// detection.
for (auto &Reduction : *Legal->getReductionVars()) {		for (auto &Reduction : *Legal->getReductionVars()) {
RecurrenceDescriptor &RedDes = Reduction.second;		RecurrenceDescriptor &RedDes = Reduction.second;
SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();		SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();
VecValuesToIgnore.insert(Casts.begin(), Casts.end());		VecValuesToIgnore.insert(Casts.begin(), Casts.end());
}		}

// Insert values known to be scalar into VecValuesToIgnore. This is a
// conservative estimation of the values that will later be scalarized.
//
// FIXME: Even though an instruction is not scalar-after-vectoriztion, it may
// still be scalarized. For example, we may find an instruction to be
// more profitable for a given vectorization factor if it were to be
// scalarized. But at this point, we haven't yet computed the
// vectorization factor.
for (auto *BB : TheLoop->getBlocks())
for (auto &I : *BB)
if (Legal->isScalarAfterVectorization(&I))
VecValuesToIgnore.insert(&I);
}		}

void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,		void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,
bool IfPredicateInstr) {		bool IfPredicateInstr) {
assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");		assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");
// Holds vector parameters or scalars, in case of uniform vals.		// Holds vector parameters or scalars, in case of uniform vals.
SmallVector<VectorParts, 4> Params;		SmallVector<VectorParts, 4> Params;

▲ Show 20 Lines • Show All 461 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll

				; REQUIRES: asserts
				; RUN: opt < %s -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -S --debug-only=loop-vectorize 2>&1 \| FileCheck %s

				; This test shows extremely high interleaving cost that, probably, should be fixed.
				; Due to the high cost, interleaving is not beneficial and the cost model chooses to scalarize
				; the load instructions.

				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				%pair = type { i8, i8 }

				; CHECK-LABEL: test
				; CHECK: Found an estimated cost of 20 for VF 2 For instruction: {{.*}} load i8
				; CHECK: Found an estimated cost of 0 for VF 2 For instruction: {{.*}} load i8
				; CHECK: vector.body
				; CHECK: load i8
				; CHECK: load i8
				; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body

				define void @test(%pair* %p, i64 %n) {
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
				%tmp0 = getelementptr %pair, %pair* %p, i64 %i, i32 0
				%tmp1 = load i8, i8* %tmp0, align 1
				%tmp2 = getelementptr %pair, %pair* %p, i64 %i, i32 1
				%tmp3 = load i8, i8* %tmp2, align 1
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp eq i64 %i.next, %n
				br i1 %cond, label %for.end, label %for.body

				for.end:
				ret void
				}

llvm/trunk/test/Transforms/LoopVectorize/X86/consecutive-ptr-uniforms.ll

	Show All 12 Lines
	; scatter operation. %tmp3 (and the induction variable) should not be marked			; scatter operation. %tmp3 (and the induction variable) should not be marked
	; uniform-after-vectorization.			; uniform-after-vectorization.
	;			;
	; CHECK: LV: Found uniform instruction: %tmp0 = getelementptr inbounds %data, %data* %d, i64 0, i32 3, i64 %i			; CHECK: LV: Found uniform instruction: %tmp0 = getelementptr inbounds %data, %data* %d, i64 0, i32 3, i64 %i
	; CHECK-NOT: LV: Found uniform instruction: %tmp3 = getelementptr inbounds %data, %data* %d, i64 0, i32 0, i64 %i			; CHECK-NOT: LV: Found uniform instruction: %tmp3 = getelementptr inbounds %data, %data* %d, i64 0, i32 0, i64 %i
	; CHECK-NOT: LV: Found uniform instruction: %i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]			; CHECK-NOT: LV: Found uniform instruction: %i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
	; CHECK-NOT: LV: Found uniform instruction: %i.next = add nuw nsw i64 %i, 5			; CHECK-NOT: LV: Found uniform instruction: %i.next = add nuw nsw i64 %i, 5
	; CHECK: vector.body:			; CHECK: vector.body:
				; CHECK: %index = phi i64
	; CHECK: %vec.ind = phi <16 x i64>			; CHECK: %vec.ind = phi <16 x i64>
	; CHECK: %[[T0:.+]] = extractelement <16 x i64> %vec.ind, i32 0			; CHECK: %[[T0:.+]] = mul i64 %index, 5
	; CHECK: %[[T1:.+]] = getelementptr inbounds %data, %data* %d, i64 0, i32 3, i64 %[[T0]]			; CHECK: %[[T1:.+]] = getelementptr inbounds %data, %data* %d, i64 0, i32 3, i64 %[[T0]]
	; CHECK: %[[T2:.+]] = bitcast float* %[[T1]] to <80 x float>*			; CHECK: %[[T2:.+]] = bitcast float* %[[T1]] to <80 x float>*
	; CHECK: load <80 x float>, <80 x float>* %[[T2]], align 4			; CHECK: load <80 x float>, <80 x float>* %[[T2]], align 4
	; CHECK: %[[T3:.+]] = getelementptr inbounds %data, %data* %d, i64 0, i32 0, i64 %[[T0]]			; CHECK: %[[T3:.+]] = getelementptr inbounds %data, %data* %d, i64 0, i32 0, i64 %[[T0]]
	; CHECK: %[[T4:.+]] = bitcast float* %[[T3]] to <80 x float>*			; CHECK: %[[T4:.+]] = bitcast float* %[[T3]] to <80 x float>*
	; CHECK: load <80 x float>, <80 x float>* %[[T4]], align 4			; CHECK: load <80 x float>, <80 x float>* %[[T4]], align 4
	; CHECK: %VectorGep = getelementptr inbounds %data, %data* %d, i64 0, i32 0, <16 x i64> %vec.ind			; CHECK: %VectorGep = getelementptr inbounds %data, %data* %d, i64 0, i32 0, <16 x i64> %vec.ind
	; CHECK: call void @llvm.masked.scatter.v16f32({{.}}, <16 x float> %VectorGep, {{.*}})			; CHECK: call void @llvm.masked.scatter.v16f32({{.}}, <16 x float> %VectorGep, {{.*}})
	Show All 26 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/gather-vs-interleave.ll

				; RUN: opt -loop-vectorize -S -mcpu=skylake-avx512 < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; This test checks that "gather" operation is choosen since it's cost is better
				; than interleaving pattern.
				;
				;unsigned long A[SIZE];
				;unsigned long B[SIZE];
				;
				;void foo() {
				; for (int i=0; i<N; i+=8) {
				; B[i] = A[i] + 5;
				; }
				;}

				@A = global [10240 x i64] zeroinitializer, align 16
				@B = global [10240 x i64] zeroinitializer, align 16


				; CHECK_LABEL: strided_load_i64
				; CHECK: masked.gather
				define void @strided_load_i64() {
				br label %1

				; <label>:1: ; preds = %0, %1
				%indvars.iv = phi i64 [ 0, %0 ], [ %indvars.iv.next, %1 ]
				%2 = getelementptr inbounds [10240 x i64], [10240 x i64]* @A, i64 0, i64 %indvars.iv
				%3 = load i64, i64* %2, align 16
				%4 = add i64 %3, 5
				%5 = getelementptr inbounds [10240 x i64], [10240 x i64]* @B, i64 0, i64 %indvars.iv
				store i64 %4, i64* %5, align 16
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 8
				%6 = icmp slt i64 %indvars.iv.next, 1024
				br i1 %6, label %1, label %7

				; <label>:7: ; preds = %1
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[Loop Vectorizer] Interleave vs Gather - in some cases Gather is better.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 87683

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/trunk/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll

llvm/trunk/test/Transforms/LoopVectorize/X86/consecutive-ptr-uniforms.ll

llvm/trunk/test/Transforms/LoopVectorize/X86/gather-vs-interleave.ll

[Loop Vectorizer] Interleave vs Gather - in some cases Gather is better.
ClosedPublic