This is an archive of the discontinued LLVM Phabricator instance.

[Loop Vectorizer] Interleave vs Gather - in some cases Gather is better.
ClosedPublic

Authored by delena on Dec 19 2016, 7:10 AM.

Download Raw Diff

Details

Reviewers

anemet
Ayal
mkuper
mssimpso

Commits

rG5267edd3e390: [Loop Vectorizer] Cost-based decision for vectorization form of memory…
rL294503: [Loop Vectorizer] Cost-based decision for vectorization form of memory…

Summary

The bug is described in PR31426.
The cost of Load instruction is calculated in the following order isConsecultive - isInterleave - isGather - scalar.
When a Load instruction belongs to Interleave group, the "Gather" option is not checked at all. But when the interleave factor exceeds the maximum, the cost is high and the "Gather" is preferred in this case. The following loop is not vectorized on AVX-512 due to this bug:
for (i=0; i<N; ++i)

B[i] = A[i*5]

Diff Detail

Repository: rL LLVM

Event Timeline

delena updated this revision to Diff 81947.Dec 19 2016, 7:10 AM

delena retitled this revision from to [Loop Vectorizer] Interleave vs Gather - in some cases Gather is better..

delena updated this object.

delena added reviewers: mkuper, Ayal, anemet.

delena set the repository for this revision to rL LLVM.

delena added a subscriber: llvm-commits.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptDec 19 2016, 7:10 AM

Added some comments.

mssimpso added a subscriber: mssimpso.Dec 19 2016, 7:44 AM

mssimpso added inline comments.Dec 19 2016, 9:23 AM

../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7047–7049 ↗	(On Diff #81949)	Why don't you just compare the costs? You wouldn't need to make this assumption anymore.

mkuper added inline comments.Dec 19 2016, 10:11 AM

../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7052 ↗	(On Diff #81949)	Aren't we already checking this in selectInterleaveCount()? How do we end up with interleave factors above MaxInterleaveFactor in the first place?

mssimpso added inline comments.Dec 19 2016, 10:25 AM

../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7052 ↗	(On Diff #81949)	Ah, our naming conventions have confused this somewhat, I think. TTI.getMaxInterleaveFacor is the hook for the max unroll factor ("interleaving"). I think Elana was wanting TLI.getMaxSupportedInterleaveFactor instead. This is the hook for determining the max factor of the interleaved access groups.

mkuper added inline comments.Dec 19 2016, 10:31 AM

../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7052 ↗	(On Diff #81949)	Argh, right. Sorry for the stupid question, I just looked at this and went "this looks odd" without actually reading the context. But yes, Elena, you want the other function. Regardless, can we maybe change the naming to something sane? As a separate patch, of course. (I don't have any good ideas, though.)

delena added inline comments.Dec 20 2016, 6:04 AM

../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7047–7049 ↗	(On Diff #81949)	The cost that we provide for interleaved access is incorrect, specially for AVX-512. AVX-512 has 3-src shuffles and the real cost is much lower . I can't compare it to Gather - the Gather cost wins today, even for small stride, but it is not true. So there are 2 bugs: (1) The loop is scalarized and Gather/Scatter option is not considered at all (2) Incorrect cost for interleaving I can start from providing a correct cost for interleaving on AVX-512. Or I can fix the (1) first of all. I'll retrieve the proper "MaxSupportedInterleaveFactor". What do you think?

mkuper added inline comments.Dec 20 2016, 11:13 AM

../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7047–7049 ↗	(On Diff #81949)	I think it would be better to fix the cost model first. It's very pessimistic for x86 in general, not AVX-512, but you're right, it's even worse for AVX-512, because the real cost is lower. But I thought Farhana was already working on that. Am I confused?

Now, when the interleaving cost calculation is correct, I compare the cost of 3 possible options - interleave, gather and scalar and choose the better option.
The decision made by cost model should be saved and latter used when we generate the vector code.

The code is refactored in order to allow this comparison, but the comparison is the *only* functional change in the code. The rest of the logic remains the same.

I added a test case that demonstrates gather - vs - interleave case.

Merged the patch with the latest changes in LV.

Ping*

Hi Elena,

This patch causes a crash in spec2006/povray on AArch64. I've pasted a test case over at P7951. The problem has to do with the analysis in collectLoopUniforms and the new decision to scalarize. collectLoopUniforms is very conservative about what instructions remain uniform after vectorization. If a memory access has the possibility of being scalarized (even though it may not be), it's pointer operand is not marked uniform. What's happening here is that you've introduced a new scalarization decision based on the cost model that collectLoopUniforms doesn't know about (and likely can't know about). In the test case, collectLoopUniforms marks the pointer operands of the loads uniform even though the loads are now scalarized, which causes the pointer operands to be non-uniform.

I'm not really sure what the best way to fix this would be. Any thoughts? One way would be to remove scalarization from your widening decision.

mssimpso added a reviewer: mssimpso.Jan 12 2017, 10:49 AM

mssimpso removed a subscriber: mssimpso.

In D27919#644255, @mssimpso wrote:

Hi Elena,

I'm not really sure what the best way to fix this would be. Any thoughts? One way would be to remove scalarization from your widening decision.

That sounds a bit unfortunate. I think we'd rather like to the cost model to be able to say "please scalarize this".

../lib/Transforms/Vectorize/LoopVectorize.cpp
1711	I'm not a fan of the name of this enum, but don't really have any good ideas. Anyone else?
1734	This looks like a suboptimal way to iterate over InterleaveGroup, because it requires a lookup for every index, instead of just iterating above the Members collection. (we don't care about the order here, right?) Perhaps add a better interface?
1740	Is this really a per-instruction thing, or do we just always end up with LV_NONE when the cost model is not in use (e.g. explicit user-provided VF). If the former, when does this happen? If the latter, then I think the comment is a bit misleading - and it would be good to be able to assert on this happening only when there's no cost model. Edit: Oh, I think I see, you care about temporary LV_NONE values for Interleave groups. I still think it'd be nice if we could somehow distinguish between the cases, so we could assert on not having a cost when we should.
2883	Extra space after =.
2897	assert(!Legal->memoryInstructionMustBeScalarized(Instr, VF)), maybe? Because now we leave the door open for the cost model decision being to widen, even though memoryInstructionMustBeScalarized(). Another option is to replace the if with: if (Decision == LoopVectorizationLegality::LV_SCALARIZE \|\| Legal->memoryInstructionMustBeScalarized(Instr, VF)) But I'm not sure it makes sense, because there's no good reason to always call memoryInstructionMustBeScalarized() in a non-asserts build - in theory, the cost model should have already checked this.
7216	Wouldn't you calculate it several times per group if it's unprofitable?
../test/Analysis/CostModel/X86/interleave-load-i32.ll
13	What happened to the VF = 1 cost?

In D27919#644255, @mssimpso wrote:

Hi Elena,

This patch causes a crash in spec2006/povray on AArch64. I've pasted a test case over at P7951. The problem has to do with the analysis in collectLoopUniforms and the new decision to scalarize. collectLoopUniforms is very conservative about what instructions remain uniform after vectorization. If a memory access has the possibility of being scalarized (even though it may not be), it's pointer operand is not marked uniform. What's happening here is that you've introduced a new scalarization decision based on the cost model that collectLoopUniforms doesn't know about (and likely can't know about). In the test case, collectLoopUniforms marks the pointer operands of the loads uniform even though the loads are now scalarized, which causes the pointer operands to be non-uniform.

I'm not really sure what the best way to fix this would be. Any thoughts? One way would be to remove scalarization from your widening decision.

I'm trying to revisit collections of Uniforms and Scalars after cost estimation and remove GEPs and induction variables if we decided to scalarize. Not finished yet..

../lib/Transforms/Vectorize/LoopVectorize.cpp
1711	CM_DECISION_NONE, CM_DECISION_WIDEN .. ?
1734	I agree, but iteration inside Factor is not a big overhead and it used in one more place, at least. This patch is complex enough, I'd postpone unrelated changes to another patch.
2897	The form I used just prevents a redundant call to memoryInstructionMustBeScalarized(). Because we may have a decision (INTERLEAVE, WIDEN or GATHER_SCATTER) at this stage.
7216	LV_NONE means "not calculated yet". If we are here, each memory instruction will have " a decision".
../test/Analysis/CostModel/X86/interleave-load-i32.ll
13	There is no group for VF 1. Instruction cost is still there.

In D27919#647975, @delena wrote:

I'm trying to revisit collections of Uniforms and Scalars after cost estimation and remove GEPs and induction variables if we decided to scalarize. Not finished yet..

That sounds like a good idea to me, and I hope we can separate the uniform/scalar collection from the cost estimation in a way that makes sense. When I looked at doing that a while back, the complication I ran into was that the cost estimates depend on knowing the uniforms. But if we use the cost estimates to help determine the uniforms, then we end up with a weird circular dependence. If we mark something uniform/scalar after computing the costs, we will need to update the costs somehow. That in turn might change what we would like to scalarize, etc.

I'm revisiting the set of Scalars and Uniforms after cost modeling. Taking into account that the cost model uses Uniforms and the Uniforms are changed after the cost modeling, I still think that there is no circular dependency here.
(1) I remove uniform GEP (and the corresponding induction) when we decide to scalarize memory instruction.
(2) I do not remove GEP from scalars if it will not be used in Gather/Scatter.

I added a test that Matthew sent me.

mssimpso added inline comments.Jan 18 2017, 9:24 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009	Hi Elena, I had been thinking about the use of isUniformAfterVectorization() here in getInstructionCost(). Wouldn't it now be possible for the set of uniforms to differ from the first collection (before VF selection) and the second collection (after VF selection)? So we would choose a VF based on costs assuming an instruction may or may not be uniform. Then we could later reverse our initial decision about the instruction's uniformity after VF selection, making the total cost on which we based our VF decision inaccurate. Or am I missing something? I haven't yet thought through the implications of this in enough detail to know whether this would matter much or not.

delena added inline comments.Jan 19 2017, 4:40 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009	About the list of Uniforms. We insert and then remove only GEPs and Induction variables. We do not calculate cost for them anyway. All other Uniform values stay in place. So, the cost is accurate at the end. There is no circular dependency here.

mssimpso added inline comments.Jan 19 2017, 5:18 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009	I don't think this is true in general. We mark an instruction uniform if all its users are uniform. So for example, if we have a uniform GEP whose index is some computation, that computation is also uniform if it's only used by the GEP. I think we have some examples in induction.ll, but something like this: %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ] %sum = add i64 %i, %x %idx = getelementptr inbounds float, float* %a, i64 %sum load float, float* %idx, align 4 The GEP is consecutive, so it will be marked uniform. %sum will aslo be marked uniform because it's only used by the GEP. If we later decide to scalarize the load, the GEP, the IV, and %sum will all no longer be uniform. So the cost for %sum will have been wrong.

mssimpso added inline comments.Jan 19 2017, 6:19 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009	Just a thought - why not recompute and cache the uniforms (and possibly scalars) for each VF we compute costs for? That would avoid any potential logical inconsistencies. I think the compile-time overhead would probably be minimal (and you're already computing these sets twice anyway).

delena added inline comments.Jan 19 2017, 7:18 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009	Just talked with Ayal about this. I can collect Uniforms after making decision about Load/Store intructions. And the decision is based on cost. The decision affects another instructions inside the loop, as you've pointed before. Theoretically, if I have N variants of representing all memory instructions inside the loop, I should examine 2**N combinations per VF. Ayal proposed the following sequence, which should be done on CM stage, after legality is finished: Per VF: Go through all memory insts and make CM decision Build Uniforms and Scalars per VF (that's what you say now) Calculate cost for VF, based on Uniforms and Scalars It is still not ideal, but, probably better than what we have.

mssimpso added inline comments.Jan 19 2017, 7:44 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009	Ayal's sequence makes sense to me. We should probably also try to move Uniforms/Scalars and related functions over to LoopVectorizationCostModel rather than keep them in LoopVectorizationLegality, as they will now be a function of Cost/VF and would be more appropriate there in my opinion. This could probably be a separate patch.

I moved Uniforms and Scalars from Legality to the Cost Model. Now we collect Uniforms and Scalars per VF and these collections depend on widening decisions that CM takes for Load/Store instructions.

The patch is big, but I did not see how to split it into separate patches.

delena added a subscriber: dorit.Jan 26 2017, 12:21 AM

mkuper mentioned this in D28975: [LV] Introducing VPlan to model the vectorized code and drive its transformation.Jan 26 2017, 4:50 PM

Ping *
We are in sync with Ayal and Gil working on VPlan.

Hi Elena,

I'll take a look at this again today. Thanks for the reminder!

Hi Elena,

Thanks for your patience. I haven't yet looked in detail at the widening decision selection, but here are some comments around the uniforms/scalars.

Matt.

../lib/Transforms/Vectorize/LoopVectorize.cpp
5614–5615	Can we change this to something like: if (Uniforms.count(VF)) return; auto &UniformsVF = Uniforms[VF]; We want to distinguish the case that (1) Uniforms have not been computed for VF from (2) Uniforms have been computed for VF but there aren't any, so we don't need to compute them again. We can end up calling this twice for the same VF if we have a user-selected VF and then compute the expected cost for interleaving. This is similar to the way we do this check in collectInstsToScalarize. This will also apply to collectLoopScalars.
5616–5619	This is not true now, right? An interleaved access may be scalarized based on the cost model.
5693	Why not roll this check into memoryInstructionMustBeScalarized? Either way, I think this check and the check below in isVectorizedMemAccessUse that calls memoryInstructionMustBeScalarized, should be the same.
7008	The VF > 1 check is not needed because you check that condition in isUniformAfterVectorization.
7023	This should be VF == 1, since we can't have a zero VF.
7373–7385	Can we just delete this in favor of a helper function that checks VecValuesToIgnore and IsScalarAfterVectorization for a given VF? Something like: bool LoopVectorizationCostModel::shouldIgnoreVecValueInCostModel(Instruction *I, unsigned VF) { return VecValuesToIgnore.count(I) \|\| isScalarAfterVectorization(I, VF); } This way we won't have to be imprecise.

delena marked 3 inline comments as done.Jan 31 2017, 7:22 AM

delena added inline comments.

../lib/Transforms/Vectorize/LoopVectorize.cpp
5614–5615	Yes. I've changed.
5616–5619	It is still true. Instruction must be scalarized if there is no any other option. This is "legality" check, cost model comes later.
5693	Yes, I don't need to check legality any more. It is already implied in CM decision.
7008	Yes. You are right, thanks!
7373–7385	Yes, it is possible.

Updated, following Matthew's comments.

Hi Elena,

Here are some more inline comments. Also, please clang-format the patch if you haven't already done so to make the review easier. Thanks!

../lib/Transforms/Vectorize/LoopVectorize.cpp
1985	How are we using CM_DECISION_NONE? Aren't we forced to make some sort of decision? I would think the default would be CM_DECISION_WIDEN unless the cost model points to one of the others.
2062–2063	The VF > 2 comment can be removed now.
2070–2071	The VF > 2 comment can be removed now.
5496–5497	Since you've added a new check in collectUnifomsAndScalars to ensure the analysis is performed only once, the check here and in collectLoopUniforms is redundant. Should these be asserts now?
5558–5563	I don't think a see a real use for this function anymore. Please see my related comment about memoryAccessMustBeScalarized. This function was only ever used in collectLoopUniforms to help determine how a memory access would be vectorized. I think you can probably greatly simplify the logic in collectLoopUniforms and remove it. Now, we know what the vectorizer will do based the the cost model decision. For a GEP to remain uniform, I think we just need to know that all its users are CM_DECISION_INTERLEAVE or CM_DECISION_WIDEN. Is this right? If so, please work it into the GEP part of collectLoopUniforms.
5616–5619	I think the cost vs legality distinction is not important here. My original intent with this function was just to consolidate all the scalarization conditions for a given access. That way it could be called when collecting uniforms, computing costs, and vectorizing and they all would agree on what would happen. Perhaps the choice of name was misleading? In any case, unless I'm missing something, I don't think you actually use this function anymore. You've replaced all uses with getWideningDecision(I, VF) == CM_DECISION_SCALARIZE This makes sense because I think you've moved all the non-cost related scalarization decisions into the inverse: memoryInstructionCanBeWidened. Can this function be deleted now?

In D27919#662246, @mssimpso wrote:

Hi Elena,

Here are some more inline comments. Also, please clang-format the patch if you haven't already done so to make the review easier. Thanks!

Done.

../lib/Transforms/Vectorize/LoopVectorize.cpp
1985	I used it while capturing decisions for an interleave group. At this point I can get rid of it. I return "NONE" if no decision, and it is convenient. InstWidening getWideningDecision(Instruction I, unsigned VF) { assert(VF >= 2 && "Expected VF >=2"); std::pair<Instruction , unsigned> InstOnVF = std::make_pair(I, VF); if (!WideningDecisions.count(InstOnVF)) return CM_DECISION_NONE; return WideningDecisions[InstOnVF]; } Then I use it in assertion, to make sure that decision is made.
5496–5497	May be. Until somebody will decide to call these functions separately. Right now I don't see a reason. I'll put an "assert" and add a comment.
5558–5563	I removed mustBeScalarized(). (I suppose you meant this function, the lines are mixed up)

delena updated this revision to Diff 86636.Feb 1 2017, 7:44 AM

mssimpso added inline comments.Feb 1 2017, 7:54 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
5558–5563	In this comment I was talking about hasConsecutiveLikePtrOperand. I don't think it's needed anymore and can probably be deleted. It's only used in collectLoopUniforms to help determine if a GEP will remain uniform. But can't we now determine that much easier by just checking the CM_DECISION of the accesses? Does that make sense?

I removed hasConsecutiveLikePtrOperand and simplified the code.

Hi Elena,

I like the direction this patch is going. Thanks for all your work. Here are some more comments inline.

../lib/Transforms/Vectorize/LoopVectorize.cpp
7069	Can we avoid the divisions here and below and use multiplication instead? We won't have any round-off issues that way. What about something like: unsigned Accesses = 1; if (Legal->isAccessInterleaved(&I)) { Accesses = Group->getNumMembers(); ... } ... ScalarizationCost *= Accesses; ...
7088–7096	I think we're fairly consistent in other places where we compare costs, that we prefer the scalar version if there's no benefit for vectorization. So if the scalarization cost is <= the other costs, I think we should go with that.
7180–7193	Can't this all be replaced by a call to getInterleaveGroupCost()?
7202–7203	Can't this all be replaced by a call to getMemInstScalarizationCost()?
7211–7261	Similar comment to the above. I don't think you have a helper for computing gather/scatter cost, but I think it would be nice. It would be easier to keep getInstructionCost in sync with setCostBasedWideningDecision.

delena marked an inline comment as done.Feb 6 2017, 4:44 AM

delena added inline comments.

../lib/Transforms/Vectorize/LoopVectorize.cpp
7088–7096	Agree. I've changed the comparison.
7180–7193	I reached the conclusion that we don't need to calculate the cost again. We can keep it together with widening decision.

The memory instruction cost that was calculated during widening decision is saved in WideningDecisions map in order to avoid recalculation.

Hi Elena,

I have one comment about getWideningCost; otherwise, the patch looks fine to me now. But please let Michael have one more pass at review since he's been quiet for a while. Thanks!

../lib/Transforms/Vectorize/LoopVectorize.cpp
2034–2035	This seems weird to me. All instructions are supposed to have a cost computed by the cost model. I would much rather us assert that I is in WideningDecisions. Isn't it true that if I is a load or store, we would have already computed a cost and saved it in WideningDecisions?
7059	Just assert inside getWideningCost() that "I" is present in the mapping and return the cost. The int to unsigned max conversion is unnecessary.

Updated according to the resent Matthew's comments.
Matthew, thanks a lot for your review.

Michael, could you, please, take a look?

The high-level structure of this looks good to me, thanks Elena!

Some minor/style comments inline.

../lib/Transforms/Vectorize/LoopVectorize.cpp
329	We probably already have several clones of those functions around the code-base. And they are probably all slightly different. LSR has getAccessType(), LAA has getAddressSpaceOperand(), LoadStoreVectorizer has getPointerAddressSpace(), and I'm sure there are more. I don't want to make merging them a precondition of this patch, but can you please at least add a FIXME here?
1982	Why not just "return Uniforms[VF]->second.count(I);"? I don't think the verbosity helps here, and we don't actually care about the difference between find() and operator[] due to the assert. Or, to save a lookup in an asserts build, you could find() and then assert on the result of the find().
1985	Maybe rename NONE to UNKNOWN, then? But I'm fine with None if you think that's clearer. Also, I looked up the naming convention for enums in the coding standard, and I think it should be something like "CM_Widen, CM_Interleave", etc, not all caps.
1990	Same as above.
2023	Here, on the other hand, I think find() would be better - that way you don't need two lookups.
2078	I'm still not sure I understand why this gets called twice for a user-provided VF. Could you explain again?
5592	Do we still want this check here? I mean: An instruction with a uniform pointer can be widended. That ties in with the way this is used - we check for the uniform case before the widening case, as we should. So, IIUC, we should never actually hit this.
7047	UINT_MAX
7061	Why the NumAccesses * 2 cut-off?
7072	UINT_MAX
7085	Why do you need the "GatherScatterCost < InterleaveCost" check here?

delena marked 5 inline comments as done.Feb 7 2017, 12:48 AM

delena added inline comments.

../lib/Transforms/Vectorize/LoopVectorize.cpp
1982	The "const" qualifier does not allow the Uniforms[VF] form.
2078	We call this function from multiple places. From calculateRegisterUsage(), selectVectorizationFactor() - user-defined-VF and from expectedCost(). I just prevent recalculation.
7061	I consider a cost per instruction. In this case InterleaveCost / NumAccesses == 1. (Matthew asked to avoid divisions). About 1 inst per access is good enough. I added more comments.

Some code improvements, addressed Michael's comments.

mkuper added inline comments.Feb 7 2017, 9:27 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
1982	Ohh, right, missed it's const, sorry.
2078	I understand, I'm just wondering whether we really need to do that. Anyway, it's not a new problem, we don't have to solve it here.
5592	I think you may have missed this comment.
7061	Well, this isn't really "about 1", this is "below 2". I'd be more conservative here (InerelaveCost <= NumAccesses ? Can this even happen? Or are you trying to catch the cases where the ratio is ~1.1-1.2?). Or maybe remove this altogether. Is getGatherScatterCost() expensive in terms of compile time?
7086	So, in case the costs are equal, you prefer scalarization to interleaving, and interleaving to scatter/gather? (In theory it shouldn't matter what happens when the costs are equal, just making sure this I understand.)

mssimpso added inline comments.Feb 7 2017, 9:36 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7061	The TTI estimates are supposed to be cheap to compute. I think it makes sense to remove this altogether in favor of greater simplicity.

delena added inline comments.Feb 8 2017, 1:02 AM

../lib/Transforms/Vectorize/LoopVectorize.cpp
2078	In the current version, before my changes, we calculate Uniforms and Scalars once and do this in Legality. In this patch, I moved Uniforms and Scalars from Legality to the Cost Model and calculate them per VF. I did not find a single place to put the call, I'm calling the collectUniformsAndScalars() from multiple places. Doing that, I want to prevent the data recalculation. I suppose that finding a right single place for calling collectUniformsAndScalars() is possible, but it will require additional movements in selectVectorizationFactor(). I think it can be done in a separate patch.
5592	I'll fix.
7061	ok.
7086	Matthew asked to change: "I think we're fairly consistent in other places where we compare costs, that we prefer the scalar version if there's no benefit for vectorization. So if the scalarization cost is <= the other costs, I think we should go with that."

Some fixes following Michael's recent comments.

LGTM

../lib/Transforms/Vectorize/LoopVectorize.cpp
2078	Ah, ok, got it. Sure, sounds good as a follow-up.
7086	Ah, ok, missed that. Could you please add this as an explicit comment?

This revision is now accepted and ready to land.Feb 8 2017, 10:21 AM

Closed by commit rL294503: [Loop Vectorizer] Cost-based decision for vectorization form of memory… (authored by delena). · Explain WhyFeb 8 2017, 11:37 AM

This revision was automatically updated to reflect the committed changes.

lebedev.ri mentioned this in D111460: [X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`.Oct 9 2021, 1:05 PM

Revision Contents

Path

Size

../

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

405 lines

test/

Analysis/

CostModel/

X86/

interleave-load-i32.ll

18 lines

interleave-store-i32.ll

18 lines

Transforms/

LoopVectorize/

AArch64/

interleaved-vs-scalar.ll

38 lines

X86/

consecutive-ptr-uniforms.ll

3 lines

gather-vs-interleave.ll

41 lines

Diff 84811

../lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 320 Lines • ▼ Show 20 Lines
static Value getPointerOperand(Value I) {		static Value getPointerOperand(Value I) {
if (auto *LI = dyn_cast<LoadInst>(I))		if (auto *LI = dyn_cast<LoadInst>(I))
return LI->getPointerOperand();		return LI->getPointerOperand();
if (auto *SI = dyn_cast<StoreInst>(I))		if (auto *SI = dyn_cast<StoreInst>(I))
return SI->getPointerOperand();		return SI->getPointerOperand();
return nullptr;		return nullptr;
}		}

/// A helper function that returns true if the given type is irregular. The		/// A helper function that returns true if the given type is irregular. The
		mkuperUnsubmitted Done Reply Inline Actions We probably already have several clones of those functions around the code-base. And they are probably all slightly different. LSR has getAccessType(), LAA has getAddressSpaceOperand(), LoadStoreVectorizer has getPointerAddressSpace(), and I'm sure there are more. I don't want to make merging them a precondition of this patch, but can you please at least add a FIXME here? mkuper: We probably already have several clones of those functions around the code-base. And they are…
/// type is irregular if its allocated size doesn't equal the store size of an		/// type is irregular if its allocated size doesn't equal the store size of an
/// element of the corresponding vector type at the given vectorization factor.		/// element of the corresponding vector type at the given vectorization factor.
static bool hasIrregularType(Type *Ty, const DataLayout &DL, unsigned VF) {		static bool hasIrregularType(Type *Ty, const DataLayout &DL, unsigned VF) {

// Determine if an array of VF elements of type Ty is "bitcast compatible"		// Determine if an array of VF elements of type Ty is "bitcast compatible"
// with a <VF x Ty> vector.		// with a <VF x Ty> vector.
if (VF > 1) {		if (VF > 1) {
auto *VectorTy = VectorType::get(Ty, VF);		auto *VectorTy = VectorType::get(Ty, VF);
▲ Show 20 Lines • Show All 1,270 Lines • ▼ Show 20 Lines	public:
bool isUniform(Value *V);		bool isUniform(Value *V);

/// Returns true if \p I is known to be uniform after vectorization.		/// Returns true if \p I is known to be uniform after vectorization.
bool isUniformAfterVectorization(Instruction *I) { return Uniforms.count(I); }		bool isUniformAfterVectorization(Instruction *I) { return Uniforms.count(I); }

/// Returns true if \p I is known to be scalar after vectorization.		/// Returns true if \p I is known to be scalar after vectorization.
bool isScalarAfterVectorization(Instruction *I) { return Scalars.count(I); }		bool isScalarAfterVectorization(Instruction *I) { return Scalars.count(I); }

		/// Update the list of scalar and uniform values taking into
		/// account the cost decisions.
		void updateScalarsAndUniforms(unsigned VF);

/// Returns the information that we collected about runtime memory check.		/// Returns the information that we collected about runtime memory check.
const RuntimePointerChecking *getRuntimePointerChecking() const {		const RuntimePointerChecking *getRuntimePointerChecking() const {
return LAI->getRuntimePointerChecking();		return LAI->getRuntimePointerChecking();
}		}

const LoopAccessInfo *getLAI() const { return LAI; }		const LoopAccessInfo *getLAI() const { return LAI; }

/// \brief Check if \p Instr belongs to any interleaved access group.		/// \brief Check if \p Instr belongs to any interleaved access group.
▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	public:
/// that are treated like consecutive pointers during vectorization. The		/// that are treated like consecutive pointers during vectorization. The
/// pointer operands of interleaved accesses are an example.		/// pointer operands of interleaved accesses are an example.
bool hasConsecutiveLikePtrOperand(Instruction *I);		bool hasConsecutiveLikePtrOperand(Instruction *I);

/// Returns true if \p I is a memory instruction that must be scalarized		/// Returns true if \p I is a memory instruction that must be scalarized
/// during vectorization.		/// during vectorization.
bool memoryInstructionMustBeScalarized(Instruction *I, unsigned VF = 1);		bool memoryInstructionMustBeScalarized(Instruction *I, unsigned VF = 1);

		/// Returns true if \p I is a memory instruction with consecutive memory
		/// access that can be widened.
		bool memoryInstructionCanBeWidened(Instruction *I, unsigned VF = 1);

		/// Decision that was taken during cost calculation for memory instruction.
		enum InstWidening {
		mkuperUnsubmitted Not Done Reply Inline Actions I'm not a fan of the name of this enum, but don't really have any good ideas. Anyone else? mkuper: I'm not a fan of the name of this enum, but don't really have any good ideas. Anyone else?
		delenaAuthorUnsubmitted Not Done Reply Inline Actions CM_DECISION_NONE, CM_DECISION_WIDEN .. ? delena: CM_DECISION_NONE, CM_DECISION_WIDEN .. ?
		CM_DECISION_NONE,
		CM_DECISION_WIDEN,
		CM_DECISION_INTERLEAVE,
		CM_DECISION_GATHER_SCATTER,
		CM_DECISION_SCALARIZE
		};

		typedef DenseMap<std::pair<Instruction *, unsigned>, InstWidening> DecisionList;

		/// Save vectorization decision \p W taken by the cost model for
		/// instruction \p I and vector width \p VF.
		void setCostModelDecision(Instruction *I, unsigned VF, InstWidening W) {
		assert(VF >= 2 && "Expected VF >=2");
		WideningDecisions[std::make_pair(I, VF)] = W;
		}

		/// Save vectorization decision \p W taken by the cost model for
		/// interleaving group \p Grp and vector width \p VF.
		void setCostModelDecision(const InterleaveGroup *Grp, unsigned VF,
		InstWidening W) {
		assert(VF >= 2 && "Expected VF >=2");
		/// Broadcast this decicion to all instructions inside the group.
		for (unsigned i = 0; i < Grp->getFactor(); ++i) {
		mkuperUnsubmitted Not Done Reply Inline Actions This looks like a suboptimal way to iterate over InterleaveGroup, because it requires a lookup for every index, instead of just iterating above the Members collection. (we don't care about the order here, right?) Perhaps add a better interface? mkuper: This looks like a suboptimal way to iterate over InterleaveGroup, because it requires a lookup…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions I agree, but iteration inside Factor is not a big overhead and it used in one more place, at least. This patch is complex enough, I'd postpone unrelated changes to another patch. delena: I agree, but iteration inside Factor is not a big overhead and it used in one more place, at…
		if (auto *I = Grp->getMember(i))
		WideningDecisions[std::make_pair(I, VF)] = W;
		}
		}

		/// Return the cost model decision for the given instruction \p I and vector
		mkuperUnsubmitted Not Done Reply Inline Actions Is this really a per-instruction thing, or do we just always end up with LV_NONE when the cost model is not in use (e.g. explicit user-provided VF). If the former, when does this happen? If the latter, then I think the comment is a bit misleading - and it would be good to be able to assert on this happening only when there's no cost model. Edit: Oh, I think I see, you care about temporary LV_NONE values for Interleave groups. I still think it'd be nice if we could somehow distinguish between the cases, so we could assert on not having a cost when we should. mkuper: Is this really a per-instruction thing, or do we just always end up with LV_NONE when the cost…
		/// width \p VF. Return CM_DECISION_NONE if this instruction did not pass
		/// through the cost modeling.
		InstWidening getCostModelDecision(Instruction *I, unsigned VF) {
		assert(VF >= 2 && "Expected VF >=2");
		std::pair<Instruction *, unsigned> InstOnVF = std::make_pair(I, VF);
		if (!WideningDecisions.count(InstOnVF))
		return CM_DECISION_NONE;
		return WideningDecisions[InstOnVF];
		}

private:		private:
/// Check if a single basic block loop is vectorizable.		/// Check if a single basic block loop is vectorizable.
/// At this point we know that this is a loop with a constant trip count		/// At this point we know that this is a loop with a constant trip count
/// and we only need to check individual instructions.		/// and we only need to check individual instructions.
bool canVectorizeInstrs();		bool canVectorizeInstrs();

/// When we vectorize loops we may change the order in which		/// When we vectorize loops we may change the order in which
/// we read and write from memory. This method checks if it is		/// we read and write from memory. This method checks if it is
/// legal to vectorize the code, considering only memory constrains.		/// legal to vectorize the code, considering only memory constrains.
/// Returns true if the loop is vectorizable		/// Returns true if the loop is vectorizable
bool canVectorizeMemory();		bool canVectorizeMemory();

/// Return true if we can vectorize this loop using the IF-conversion		/// Return true if we can vectorize this loop using the IF-conversion
/// transformation.		/// transformation.
bool canVectorizeWithIfConvert();		bool canVectorizeWithIfConvert();

/// Collect the instructions that are uniform after vectorization. An		/// Collect the instructions that are uniform after vectorization. An
/// instruction is uniform if we represent it with a single scalar value in		/// instruction is uniform if we represent it with a single scalar value in
/// the vectorized loop corresponding to each vector iteration. Examples of		/// the vectorized loop corresponding to each vector iteration. Examples of
/// uniform instructions include pointer operands of consecutive or		/// uniform instructions include pointer operands of consecutive or
/// interleaved memory accesses. Note that although uniformity implies an		/// interleaved memory accesses. Note that although uniformity implies an
/// instruction will be scalar, the reverse is not true. In general, a		/// instruction will be scalar, the reverse is not true. In general, a
/// scalarized instruction will be represented by VF scalar values in the		/// scalarized instruction will be represented by VF scalar values in the
/// vectorized loop, each corresponding to an iteration of the original		/// vectorized loop, each corresponding to an iteration of the original
/// scalar loop.		/// scalar loop. When the VF is known (\p KnownVF > 2), we take cost model
void collectLoopUniforms();		/// decisions into consideration.
		void collectLoopUniforms(unsigned KnownVF = 0);

/// Collect the instructions that are scalar after vectorization. An		/// Collect the instructions that are scalar after vectorization. An
/// instruction is scalar if it is known to be uniform or will be scalarized		/// instruction is scalar if it is known to be uniform or will be scalarized
/// during vectorization. Non-uniform scalarized instructions will be		/// during vectorization. Non-uniform scalarized instructions will be
/// represented by VF values in the vectorized loop, each corresponding to an		/// represented by VF values in the vectorized loop, each corresponding to an
/// iteration of the original scalar loop.		/// iteration of the original scalar loop. When the VF is known
void collectLoopScalars();		/// (\p KnownVF > 2), we take cost model decisions into consideration.
		void collectLoopScalars(unsigned KnownVF = 0);

/// Return true if all of the instructions in the block can be speculatively		/// Return true if all of the instructions in the block can be speculatively
/// executed. \p SafePtrs is a list of addresses that are known to be legal		/// executed. \p SafePtrs is a list of addresses that are known to be legal
/// and we know that we can read from them without segfault.		/// and we know that we can read from them without segfault.
bool blockCanBePredicated(BasicBlock BB, SmallPtrSetImpl<Value > &SafePtrs);		bool blockCanBePredicated(BasicBlock BB, SmallPtrSetImpl<Value > &SafePtrs);

/// Updates the vectorization state by adding \p Phi to the inductions list.		/// Updates the vectorization state by adding \p Phi to the inductions list.
/// This can set \p Phi as the main induction of the loop if \p Phi is a		/// This can set \p Phi as the main induction of the loop if \p Phi is a
▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	private:
LoopVectorizationRequirements *Requirements;		LoopVectorizationRequirements *Requirements;

/// Used to emit an analysis of any legality issues.		/// Used to emit an analysis of any legality issues.
LoopVectorizeHints *Hints;		LoopVectorizeHints *Hints;

/// While vectorizing these instructions we have to generate a		/// While vectorizing these instructions we have to generate a
/// call to the appropriate masked intrinsic		/// call to the appropriate masked intrinsic
SmallPtrSet<const Instruction *, 8> MaskedOp;		SmallPtrSet<const Instruction *, 8> MaskedOp;

		/// Keeps cost model decisions for instructions.
		/// Right now it is used for memory instructions only.
		DecisionList WideningDecisions;
};		};

/// LoopVectorizationCostModel - estimates the expected speedups due to		/// LoopVectorizationCostModel - estimates the expected speedups due to
/// vectorization.		/// vectorization.
/// In many cases vectorization is not profitable. This can happen because of		/// In many cases vectorization is not profitable. This can happen because of
/// a number of reasons. In this class we mainly attempt to predict the		/// a number of reasons. In this class we mainly attempt to predict the
/// expected speedup/slowdowns due to the supported instruction set. We use the		/// expected speedup/slowdowns due to the supported instruction set. We use the
/// TargetTransformInfo to query the different backends for the cost of		/// TargetTransformInfo to query the different backends for the cost of
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	public:
}		}

/// \returns True if instruction \p I can be truncated to a smaller bitwidth		/// \returns True if instruction \p I can be truncated to a smaller bitwidth
/// for vectorization factor \p VF.		/// for vectorization factor \p VF.
bool canTruncateToMinimalBitwidth(Instruction *I, unsigned VF) const {		bool canTruncateToMinimalBitwidth(Instruction *I, unsigned VF) const {
return VF > 1 && MinBWs.count(I) && !isProfitableToScalarize(I, VF) &&		return VF > 1 && MinBWs.count(I) && !isProfitableToScalarize(I, VF) &&
!Legal->isScalarAfterVectorization(I);		!Legal->isScalarAfterVectorization(I);
}		}

		mkuperUnsubmitted Not Done Reply Inline Actions Why not just "return Uniforms[VF]->second.count(I);"? I don't think the verbosity helps here, and we don't actually care about the difference between find() and operator[] due to the assert. Or, to save a lookup in an asserts build, you could find() and then assert on the result of the find(). mkuper: Why not just "return Uniforms[VF]->second.count(I);"? I don't think the verbosity helps here…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions The "const" qualifier does not allow the Uniforms[VF] form. delena: The "const" qualifier does not allow the Uniforms[VF] form.
		mkuperUnsubmitted Not Done Reply Inline Actions Ohh, right, missed it's const, sorry. mkuper: Ohh, right, missed it's const, sorry.
private:		private:
/// The vectorization cost is a combination of the cost itself and a boolean		/// The vectorization cost is a combination of the cost itself and a boolean
/// indicating whether any of the contributing operations will actually		/// indicating whether any of the contributing operations will actually
		mssimpsoUnsubmitted Not Done Reply Inline Actions How are we using CM_DECISION_NONE? Aren't we forced to make some sort of decision? I would think the default would be CM_DECISION_WIDEN unless the cost model points to one of the others. mssimpso: How are we using CM_DECISION_NONE? Aren't we forced to make some sort of decision? I would…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions I used it while capturing decisions for an interleave group. At this point I can get rid of it. I return "NONE" if no decision, and it is convenient. InstWidening getWideningDecision(Instruction I, unsigned VF) { assert(VF >= 2 && "Expected VF >=2"); std::pair<Instruction , unsigned> InstOnVF = std::make_pair(I, VF); if (!WideningDecisions.count(InstOnVF)) return CM_DECISION_NONE; return WideningDecisions[InstOnVF]; } Then I use it in assertion, to make sure that decision is made. delena: I used it while capturing decisions for an interleave group. At this point I can get rid of it.
		mkuperUnsubmitted Not Done Reply Inline Actions Maybe rename NONE to UNKNOWN, then? But I'm fine with None if you think that's clearer. Also, I looked up the naming convention for enums in the coding standard, and I think it should be something like "CM_Widen, CM_Interleave", etc, not all caps. mkuper: Maybe rename NONE to UNKNOWN, then? But I'm fine with None if you think that's clearer. Also, I…
/// operate on		/// operate on
/// vector values after type legalization in the backend. If this latter value		/// vector values after type legalization in the backend. If this latter value
/// is		/// is
/// false, then all operations will be scalarized (i.e. no vectorization has		/// false, then all operations will be scalarized (i.e. no vectorization has
/// actually taken place).		/// actually taken place).
		mkuperUnsubmitted Not Done Reply Inline Actions Same as above. mkuper: Same as above.
typedef std::pair<unsigned, bool> VectorizationCostTy;		typedef std::pair<unsigned, bool> VectorizationCostTy;

/// Returns the expected execution cost. The unit of the cost does		/// Returns the expected execution cost. The unit of the cost does
/// not matter because we use the 'cost' units to compare different		/// not matter because we use the 'cost' units to compare different
/// vector widths. The cost that is returned is not normalized by		/// vector widths. The cost that is returned is not normalized by
/// the factor width.		/// the factor width.
VectorizationCostTy expectedCost(unsigned VF);		VectorizationCostTy expectedCost(unsigned VF);

/// Returns the execution time cost of an instruction for a given vector		/// Returns the execution time cost of an instruction for a given vector
/// width. Vector width of one means scalar.		/// width. Vector width of one means scalar.
VectorizationCostTy getInstructionCost(Instruction *I, unsigned VF);		VectorizationCostTy getInstructionCost(Instruction *I, unsigned VF);

/// The cost-computation logic from getInstructionCost which provides		/// The cost-computation logic from getInstructionCost which provides
/// the vector type as an output parameter.		/// the vector type as an output parameter.
unsigned getInstructionCost(Instruction I, unsigned VF, Type &VectorTy);		unsigned getInstructionCost(Instruction I, unsigned VF, Type &VectorTy);

		/// The cost computation for scalarized memory instruction.
		unsigned getMemInstScalarizationCost(Instruction *I, unsigned VF);

		/// The cost computation for interleaving group of memory instructions.
		unsigned getInterleaveGroupCost(Instruction *I, unsigned VF);

/// Returns whether the instruction is a load or store and will be a emitted		/// Returns whether the instruction is a load or store and will be a emitted
/// as a vector operation.		/// as a vector operation.
bool isConsecutiveLoadOrStore(Instruction *I);		bool isConsecutiveLoadOrStore(Instruction *I);

/// Create an analysis remark that explains why vectorization failed		/// Create an analysis remark that explains why vectorization failed
///		///
/// \p RemarkName is the identifier for the remark. \return the remark object		/// \p RemarkName is the identifier for the remark. \return the remark object
/// that can be streamed to.		/// that can be streamed to.
OptimizationRemarkAnalysis createMissedAnalysis(StringRef RemarkName) {		OptimizationRemarkAnalysis createMissedAnalysis(StringRef RemarkName) {
return ::createMissedAnalysis(Hints->vectorizeAnalysisPassName(),		return ::createMissedAnalysis(Hints->vectorizeAnalysisPassName(),
RemarkName, TheLoop);		RemarkName, TheLoop);
		mkuperUnsubmitted Done Reply Inline Actions Here, on the other hand, I think find() would be better - that way you don't need two lookups. mkuper: Here, on the other hand, I think find() would be better - that way you don't need two lookups.
}		}

/// Map of scalar integer values to the smallest bitwidth they can be legally		/// Map of scalar integer values to the smallest bitwidth they can be legally
/// represented as. The vector equivalents of these values should be truncated		/// represented as. The vector equivalents of these values should be truncated
/// to this type.		/// to this type.
MapVector<Instruction *, uint64_t> MinBWs;		MapVector<Instruction *, uint64_t> MinBWs;

/// A type representing the costs for instructions if they were to be		/// A type representing the costs for instructions if they were to be
/// scalarized rather than vectorized. The entries are Instruction-Cost		/// scalarized rather than vectorized. The entries are Instruction-Cost
/// pairs.		/// pairs.
typedef DenseMap<Instruction *, unsigned> ScalarCostsTy;		typedef DenseMap<Instruction *, unsigned> ScalarCostsTy;

		mssimpsoUnsubmitted Not Done Reply Inline Actions This seems weird to me. All instructions are supposed to have a cost computed by the cost model. I would much rather us assert that I is in WideningDecisions. Isn't it true that if I is a load or store, we would have already computed a cost and saved it in WideningDecisions? mssimpso: This seems weird to me. All instructions are supposed to have a cost computed by the cost model.
/// A map holding scalar costs for different vectorization factors. The		/// A map holding scalar costs for different vectorization factors. The
/// presence of a cost for an instruction in the mapping indicates that the		/// presence of a cost for an instruction in the mapping indicates that the
/// instruction will be scalarized when vectorizing with the associated		/// instruction will be scalarized when vectorizing with the associated
/// vectorization factor. The entries are VF-ScalarCostTy pairs.		/// vectorization factor. The entries are VF-ScalarCostTy pairs.
DenseMap<unsigned, ScalarCostsTy> InstsToScalarize;		DenseMap<unsigned, ScalarCostsTy> InstsToScalarize;

/// Returns the expected difference in cost from scalarizing the expression		/// Returns the expected difference in cost from scalarizing the expression
/// feeding a predicated instruction \p PredInst. The instructions to		/// feeding a predicated instruction \p PredInst. The instructions to
Show All 10 Lines
public:		public:
/// The loop that we evaluate.		/// The loop that we evaluate.
Loop *TheLoop;		Loop *TheLoop;
/// Predicated scalar evolution analysis.		/// Predicated scalar evolution analysis.
PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;
/// Loop Info analysis.		/// Loop Info analysis.
LoopInfo *LI;		LoopInfo *LI;
/// Vectorization legality.		/// Vectorization legality.
LoopVectorizationLegality *Legal;		LoopVectorizationLegality *Legal;
/// Vector target information.		/// Vector target information.
		mssimpsoUnsubmitted Done Reply Inline Actions The VF > 2 comment can be removed now. mssimpso: The VF > 2 comment can be removed now.
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
/// Target Library Info.		/// Target Library Info.
const TargetLibraryInfo *TLI;		const TargetLibraryInfo *TLI;
/// Demanded bits analysis.		/// Demanded bits analysis.
DemandedBits *DB;		DemandedBits *DB;
/// Assumption cache.		/// Assumption cache.
AssumptionCache *AC;		AssumptionCache *AC;
/// Interface to emit optimization remarks.		/// Interface to emit optimization remarks.
		mssimpsoUnsubmitted Done Reply Inline Actions The VF > 2 comment can be removed now. mssimpso: The VF > 2 comment can be removed now.
OptimizationRemarkEmitter *ORE;		OptimizationRemarkEmitter *ORE;

const Function *TheFunction;		const Function *TheFunction;
/// Loop Vectorize Hint.		/// Loop Vectorize Hint.
const LoopVectorizeHints *Hints;		const LoopVectorizeHints *Hints;
/// Values to ignore in the cost model.		/// Values to ignore in the cost model.
SmallPtrSet<const Value *, 16> ValuesToIgnore;		SmallPtrSet<const Value *, 16> ValuesToIgnore;
		mkuperUnsubmitted Not Done Reply Inline Actions I'm still not sure I understand why this gets called twice for a user-provided VF. Could you explain again? mkuper: I'm still not sure I understand why this gets called twice for a user-provided VF. Could you…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions We call this function from multiple places. From calculateRegisterUsage(), selectVectorizationFactor() - user-defined-VF and from expectedCost(). I just prevent recalculation. delena: We call this function from multiple places. From calculateRegisterUsage()…
		mkuperUnsubmitted Not Done Reply Inline Actions I understand, I'm just wondering whether we really need to do that. Anyway, it's not a new problem, we don't have to solve it here. mkuper: I understand, I'm just wondering whether we really need to do that. Anyway, it's not a new…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions In the current version, before my changes, we calculate Uniforms and Scalars once and do this in Legality. In this patch, I moved Uniforms and Scalars from Legality to the Cost Model and calculate them per VF. I did not find a single place to put the call, I'm calling the collectUniformsAndScalars() from multiple places. Doing that, I want to prevent the data recalculation. I suppose that finding a right single place for calling collectUniformsAndScalars() is possible, but it will require additional movements in selectVectorizationFactor(). I think it can be done in a separate patch. delena: In the current version, before my changes, we calculate Uniforms and Scalars once and do this…
		mkuperUnsubmitted Not Done Reply Inline Actions Ah, ok, got it. Sure, sounds good as a follow-up. mkuper: Ah, ok, got it. Sure, sounds good as a follow-up.
/// Values to ignore in the cost model when VF > 1.		/// Values to ignore in the cost model when VF > 1.
SmallPtrSet<const Value *, 16> VecValuesToIgnore;		SmallPtrSet<const Value *, 16> VecValuesToIgnore;
};		};

/// \brief This holds vectorization requirements that must be verified late in		/// \brief This holds vectorization requirements that must be verified late in
/// the process. The requirements are set by legalize and costmodel. Once		/// the process. The requirements are set by legalize and costmodel. Once
/// vectorization has been determined to be possible and profitable the		/// vectorization has been determined to be possible and profitable the
/// requirements can be verified by looking for metadata or compiler options.		/// requirements can be verified by looking for metadata or compiler options.
▲ Show 20 Lines • Show All 785 Lines • ▼ Show 20 Lines

void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) {		void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) {
// Attempt to issue a wide load.		// Attempt to issue a wide load.
LoadInst *LI = dyn_cast<LoadInst>(Instr);		LoadInst *LI = dyn_cast<LoadInst>(Instr);
StoreInst *SI = dyn_cast<StoreInst>(Instr);		StoreInst *SI = dyn_cast<StoreInst>(Instr);

assert((LI \|\| SI) && "Invalid Load/Store instruction");		assert((LI \|\| SI) && "Invalid Load/Store instruction");

		// The vectorization decision taken during the cost computation.
		LoopVectorizationLegality::InstWidening Decision =
		Legal->getCostModelDecision(Instr, VF);
		bool NoCostModelDecision = (Decision ==
		mkuperUnsubmitted Done Reply Inline Actions Extra space after =. mkuper: Extra space after =.
		LoopVectorizationLegality::CM_DECISION_NONE);

// Try to vectorize the interleave group if this access is interleaved.		// Try to vectorize the interleave group if this access is interleaved.
if (Legal->isAccessInterleaved(Instr))		if (Decision == LoopVectorizationLegality::CM_DECISION_INTERLEAVE \|\|
		(NoCostModelDecision && Legal->isAccessInterleaved(Instr)))
return vectorizeInterleaveGroup(Instr);		return vectorizeInterleaveGroup(Instr);

		// Scalarize the memory instruction if necessary.
		if (Decision == LoopVectorizationLegality::CM_DECISION_SCALARIZE \|\|
		(NoCostModelDecision &&
		Legal->memoryInstructionMustBeScalarized(Instr, VF)))
		return scalarizeInstruction(Instr, Legal->isScalarWithPredication(Instr));

Type *ScalarDataTy = LI ? LI->getType() : SI->getValueOperand()->getType();		Type *ScalarDataTy = LI ? LI->getType() : SI->getValueOperand()->getType();
		mkuperUnsubmitted Not Done Reply Inline Actions assert(!Legal->memoryInstructionMustBeScalarized(Instr, VF)), maybe? Because now we leave the door open for the cost model decision being to widen, even though memoryInstructionMustBeScalarized(). Another option is to replace the if with: if (Decision == LoopVectorizationLegality::LV_SCALARIZE \|\| Legal->memoryInstructionMustBeScalarized(Instr, VF)) But I'm not sure it makes sense, because there's no good reason to always call memoryInstructionMustBeScalarized() in a non-asserts build - in theory, the cost model should have already checked this. mkuper: assert(!Legal->memoryInstructionMustBeScalarized(Instr, VF)), maybe? Because now we leave the…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions The form I used just prevents a redundant call to memoryInstructionMustBeScalarized(). Because we may have a decision (INTERLEAVE, WIDEN or GATHER_SCATTER) at this stage. delena: The form I used just prevents a redundant call to memoryInstructionMustBeScalarized(). Because…
Type *DataTy = VectorType::get(ScalarDataTy, VF);		Type *DataTy = VectorType::get(ScalarDataTy, VF);
Value *Ptr = getPointerOperand(Instr);		Value *Ptr = getPointerOperand(Instr);
unsigned Alignment = LI ? LI->getAlignment() : SI->getAlignment();		unsigned Alignment = LI ? LI->getAlignment() : SI->getAlignment();
// An alignment of 0 means target abi alignment. We need to use the scalar's		// An alignment of 0 means target abi alignment. We need to use the scalar's
// target abi alignment in such a case.		// target abi alignment in such a case.
const DataLayout &DL = Instr->getModule()->getDataLayout();		const DataLayout &DL = Instr->getModule()->getDataLayout();
if (!Alignment)		if (!Alignment)
Alignment = DL.getABITypeAlignment(ScalarDataTy);		Alignment = DL.getABITypeAlignment(ScalarDataTy);
unsigned AddressSpace = Ptr->getType()->getPointerAddressSpace();		unsigned AddressSpace = Ptr->getType()->getPointerAddressSpace();

// Scalarize the memory instruction if necessary.
if (Legal->memoryInstructionMustBeScalarized(Instr, VF))
return scalarizeInstruction(Instr, Legal->isScalarWithPredication(Instr));

// Determine if the pointer operand of the access is either consecutive or		// Determine if the pointer operand of the access is either consecutive or
// reverse consecutive.		// reverse consecutive.
int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);		int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);
bool Reverse = ConsecutiveStride < 0;		bool Reverse = ConsecutiveStride < 0;

// Determine if either a gather or scatter operation is legal.
bool CreateGatherScatter =		bool CreateGatherScatter =
!ConsecutiveStride && Legal->isLegalGatherOrScatter(Instr);		(Decision == LoopVectorizationLegality::CM_DECISION_GATHER_SCATTER \|\|
		(NoCostModelDecision && !ConsecutiveStride &&
		Legal->isLegalGatherOrScatter(Instr)));

VectorParts VectorGep;		VectorParts VectorGep;

// Handle consecutive loads/stores.		// Handle consecutive loads/stores.
GetElementPtrInst *Gep = getGEPInstruction(Ptr);		GetElementPtrInst *Gep = getGEPInstruction(Ptr);
if (ConsecutiveStride) {		if (ConsecutiveStride) {
if (Gep) {		if (Gep) {
unsigned NumOperands = Gep->getNumOperands();		unsigned NumOperands = Gep->getNumOperands();
▲ Show 20 Lines • Show All 1,018 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::vectorizeLoop() {
// are vectorized, so we can use them to construct the PHI.		// are vectorized, so we can use them to construct the PHI.
PhiVector PHIsToFix;		PhiVector PHIsToFix;

// Collect instructions from the original loop that will become trivially		// Collect instructions from the original loop that will become trivially
// dead in the vectorized loop. We don't need to vectorize these		// dead in the vectorized loop. We don't need to vectorize these
// instructions.		// instructions.
collectTriviallyDeadInstructions();		collectTriviallyDeadInstructions();

		// Now, when the VF is finalized, the list of uniform and scalar values
		// should be updated in order to take the cost model decisions into
		// consideration.
		Legal->updateScalarsAndUniforms(VF);

// Scan the loop in a topological order to ensure that defs are vectorized		// Scan the loop in a topological order to ensure that defs are vectorized
// before users.		// before users.
LoopBlocksDFS DFS(OrigLoop);		LoopBlocksDFS DFS(OrigLoop);
DFS.perform(LI);		DFS.perform(LI);

// Vectorize all of the blocks in the original loop.		// Vectorize all of the blocks in the original loop.
for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO()))		for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO()))
vectorizeBlockInLoop(BB, &PHIsToFix);		vectorizeBlockInLoop(BB, &PHIsToFix);
▲ Show 20 Lines • Show All 1,523 Lines • ▼ Show 20 Lines	bool LoopVectorizationLegality::canVectorizeInstrs() {
// is the same size. If it's not, unset it here and InnerLoopVectorizer		// is the same size. If it's not, unset it here and InnerLoopVectorizer
// will create another.		// will create another.
if (Induction && WidestIndTy != Induction->getType())		if (Induction && WidestIndTy != Induction->getType())
Induction = nullptr;		Induction = nullptr;

return true;		return true;
}		}

void LoopVectorizationLegality::collectLoopScalars() {		void LoopVectorizationLegality::collectLoopScalars(unsigned KnownVF) {

// If an instruction is uniform after vectorization, it will remain scalar.		// If an instruction is uniform after vectorization, it will remain scalar.
Scalars.insert(Uniforms.begin(), Uniforms.end());		Scalars.insert(Uniforms.begin(), Uniforms.end());
		mssimpsoUnsubmitted Not Done Reply Inline Actions Since you've added a new check in collectUnifomsAndScalars to ensure the analysis is performed only once, the check here and in collectLoopUniforms is redundant. Should these be asserts now? mssimpso: Since you've added a new check in collectUnifomsAndScalars to ensure the analysis is performed…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions May be. Until somebody will decide to call these functions separately. Right now I don't see a reason. I'll put an "assert" and add a comment. delena: May be. Until somebody will decide to call these functions separately. Right now I don't see a…

// Collect the getelementptr instructions that will not be vectorized. A		// Collect the getelementptr instructions that will not be vectorized. A
// getelementptr instruction is only vectorized if it is used for a legal		// getelementptr instruction is only vectorized if it is used for a legal
// gather or scatter operation.		// gather or scatter operation.
for (auto *BB : TheLoop->blocks())		for (auto *BB : TheLoop->blocks())
for (auto &I : *BB) {		for (auto &I : *BB) {
if (auto *GEP = dyn_cast<GetElementPtrInst>(&I)) {		if (auto *GEP = dyn_cast<GetElementPtrInst>(&I)) {
Scalars.insert(GEP);		Scalars.insert(GEP);
continue;		continue;
}		}
auto *Ptr = getPointerOperand(&I);		auto *Ptr = getPointerOperand(&I);
if (!Ptr)		if (!Ptr)
continue;		continue;
auto *GEP = getGEPInstruction(Ptr);		auto *GEP = getGEPInstruction(Ptr);
if (GEP && isLegalGatherOrScatter(&I))		if (GEP) {
		InstWidening Decision = KnownVF >= 2 ?
		getCostModelDecision(&I, KnownVF) : CM_DECISION_NONE;
		// Cost model decision has a higher priority.
		// Otherwise, the first priority is interleave, then gather/scatter.
		if (Decision == CM_DECISION_GATHER_SCATTER \|\|
		(Decision == CM_DECISION_NONE && !isAccessInterleaved(&I) &&
		isLegalGatherOrScatter(&I))) {
Scalars.erase(GEP);		Scalars.erase(GEP);
		DEBUG(dbgs() << "Instruction " << *GEP << " has been removed from scalars.\n");
		}
		}
}		}

// An induction variable will remain scalar if all users of the induction		// An induction variable will remain scalar if all users of the induction
// variable and induction variable update remain scalar.		// variable and induction variable update remain scalar.
auto *Latch = TheLoop->getLoopLatch();		auto *Latch = TheLoop->getLoopLatch();
for (auto &Induction : *getInductionVars()) {		for (auto &Induction : *getInductionVars()) {
auto *Ind = Induction.first;		auto *Ind = Induction.first;
auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));		auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));
Show All 18 Lines	for (auto &Induction : *getInductionVars()) {

// The induction variable and its update instruction will remain scalar.		// The induction variable and its update instruction will remain scalar.
Scalars.insert(Ind);		Scalars.insert(Ind);
Scalars.insert(IndUpdate);		Scalars.insert(IndUpdate);
}		}
}		}

bool LoopVectorizationLegality::hasConsecutiveLikePtrOperand(Instruction *I) {		bool LoopVectorizationLegality::hasConsecutiveLikePtrOperand(Instruction *I) {
if (isAccessInterleaved(I))		if (isAccessInterleaved(I))
return true;		return true;
if (auto *Ptr = getPointerOperand(I))		if (auto *Ptr = getPointerOperand(I))
return isConsecutivePtr(Ptr);		return isConsecutivePtr(Ptr);
return false;		return false;
}		}
		mssimpsoUnsubmitted Not Done Reply Inline Actions I don't think a see a real use for this function anymore. Please see my related comment about memoryAccessMustBeScalarized. This function was only ever used in collectLoopUniforms to help determine how a memory access would be vectorized. I think you can probably greatly simplify the logic in collectLoopUniforms and remove it. Now, we know what the vectorizer will do based the the cost model decision. For a GEP to remain uniform, I think we just need to know that all its users are CM_DECISION_INTERLEAVE or CM_DECISION_WIDEN. Is this right? If so, please work it into the GEP part of collectLoopUniforms. mssimpso: I don't think a see a real use for this function anymore. Please see my related comment about…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions I removed mustBeScalarized(). (I suppose you meant this function, the lines are mixed up) delena: I removed mustBeScalarized(). (I suppose you meant this function, the lines are mixed up)
		mssimpsoUnsubmitted Not Done Reply Inline Actions In this comment I was talking about hasConsecutiveLikePtrOperand. I don't think it's needed anymore and can probably be deleted. It's only used in collectLoopUniforms to help determine if a GEP will remain uniform. But can't we now determine that much easier by just checking the CM_DECISION of the accesses? Does that make sense? mssimpso: In this comment I was talking about hasConsecutiveLikePtrOperand. I don't think it's needed…

bool LoopVectorizationLegality::isScalarWithPredication(Instruction *I) {		bool LoopVectorizationLegality::isScalarWithPredication(Instruction *I) {
if (!blockNeedsPredication(I->getParent()))		if (!blockNeedsPredication(I->getParent()))
return false;		return false;
switch(I->getOpcode()) {		switch(I->getOpcode()) {
default:		default:
break;		break;
case Instruction::Store:		case Instruction::Store:
return !isMaskRequired(I);		return !isMaskRequired(I);
case Instruction::UDiv:		case Instruction::UDiv:
case Instruction::SDiv:		case Instruction::SDiv:
case Instruction::SRem:		case Instruction::SRem:
case Instruction::URem:		case Instruction::URem:
return mayDivideByZero(*I);		return mayDivideByZero(*I);
}		}
return false;		return false;
}		}

bool LoopVectorizationLegality::memoryInstructionMustBeScalarized(		bool LoopVectorizationLegality::memoryInstructionCanBeWidened(Instruction *I,
Instruction *I, unsigned VF) {		unsigned VF) {

// If the memory instruction is in an interleaved group, it will be
// vectorized and its pointer will remain uniform.
if (isAccessInterleaved(I))
return false;

// Get and ensure we have a valid memory instruction.		// Get and ensure we have a valid memory instruction.
LoadInst *LI = dyn_cast<LoadInst>(I);		LoadInst *LI = dyn_cast<LoadInst>(I);
StoreInst *SI = dyn_cast<StoreInst>(I);		StoreInst *SI = dyn_cast<StoreInst>(I);
assert((LI \|\| SI) && "Invalid memory instruction");		assert((LI \|\| SI) && "Invalid memory instruction");

// If the pointer operand is uniform (loop invariant), the memory instruction		// If the pointer operand is uniform (loop invariant), the memory instruction
// will be scalarized.		// will be scalarized.
auto *Ptr = getPointerOperand(I);		auto *Ptr = getPointerOperand(I);
if (LI && isUniform(Ptr))		if (LI && isUniform(Ptr))
		mkuperUnsubmitted Not Done Reply Inline Actions Do we still want this check here? I mean: An instruction with a uniform pointer can be widended. That ties in with the way this is used - we check for the uniform case before the widening case, as we should. So, IIUC, we should never actually hit this. mkuper: Do we still want this check here? I mean: 1) An instruction with a uniform pointer can be…
		mkuperUnsubmitted Not Done Reply Inline Actions I think you may have missed this comment. mkuper: I think you may have missed this comment.
		delenaAuthorUnsubmitted Not Done Reply Inline Actions I'll fix. delena: I'll fix.
return true;		return false;

// If the pointer operand is non-consecutive and neither a gather nor a		if (!isConsecutivePtr(Ptr))
// scatter operation is legal, the memory instruction will be scalarized.		return false;
if (!isConsecutivePtr(Ptr) && !isLegalGatherOrScatter(I))
return true;

// If the instruction is a store located in a predicated block, it will be		// If the instruction is a store located in a predicated block, it will be
// scalarized.		// scalarized.
if (isScalarWithPredication(I))		if (isScalarWithPredication(I))
return true;		return false;

// If the instruction's allocated size doesn't equal it's type size, it		// If the instruction's allocated size doesn't equal it's type size, it
// requires padding and will be scalarized.		// requires padding and will be scalarized.
auto &DL = I->getModule()->getDataLayout();		auto &DL = I->getModule()->getDataLayout();
auto *ScalarTy = LI ? LI->getType() : SI->getValueOperand()->getType();		auto *ScalarTy = LI ? LI->getType() : SI->getValueOperand()->getType();
if (hasIrregularType(ScalarTy, DL, VF))		if (hasIrregularType(ScalarTy, DL, VF))
		return false;

return true;		return true;
		}

		bool LoopVectorizationLegality::memoryInstructionMustBeScalarized(
		Instruction *I, unsigned VF) {

		mssimpsoUnsubmitted Not Done Reply Inline Actions Can we change this to something like: if (Uniforms.count(VF)) return; auto &UniformsVF = Uniforms[VF]; We want to distinguish the case that (1) Uniforms have not been computed for VF from (2) Uniforms have been computed for VF but there aren't any, so we don't need to compute them again. We can end up calling this twice for the same VF if we have a user-selected VF and then compute the expected cost for interleaving. This is similar to the way we do this check in collectInstsToScalarize. This will also apply to collectLoopScalars. mssimpso: Can we change this to something like: ``` if (Uniforms.count(VF)) return; auto &UniformsVF…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions Yes. I've changed. delena: Yes. I've changed.
// Otherwise, the memory instruction should be vectorized if the rest of the		// If the memory instruction is in an interleaved group, it will be
// loop is.		// vectorized and its pointer will remain uniform.
		if (isAccessInterleaved(I) \|\| isLegalGatherOrScatter(I))
return false;		return false;
		mssimpsoUnsubmitted Not Done Reply Inline Actions This is not true now, right? An interleaved access may be scalarized based on the cost model. mssimpso: This is not true now, right? An interleaved access may be scalarized based on the cost model.
		delenaAuthorUnsubmitted Not Done Reply Inline Actions It is still true. Instruction must be scalarized if there is no any other option. This is "legality" check, cost model comes later. delena: It is still true. Instruction must be scalarized if there is no any other option. This is…
		mssimpsoUnsubmitted Not Done Reply Inline Actions I think the cost vs legality distinction is not important here. My original intent with this function was just to consolidate all the scalarization conditions for a given access. That way it could be called when collecting uniforms, computing costs, and vectorizing and they all would agree on what would happen. Perhaps the choice of name was misleading? In any case, unless I'm missing something, I don't think you actually use this function anymore. You've replaced all uses with getWideningDecision(I, VF) == CM_DECISION_SCALARIZE This makes sense because I think you've moved all the non-cost related scalarization decisions into the inverse: memoryInstructionCanBeWidened. Can this function be deleted now? mssimpso: I think the cost vs legality distinction is not important here. My original intent with this…

		return !memoryInstructionCanBeWidened(I, VF);
		}

		// Update the list of scalar and uniform values.
		void LoopVectorizationLegality::updateScalarsAndUniforms(unsigned VF) {
		Uniforms.clear();
		Scalars.clear();
		collectLoopUniforms(VF);
		collectLoopScalars(VF);
}		}

void LoopVectorizationLegality::collectLoopUniforms() {		void LoopVectorizationLegality::collectLoopUniforms(unsigned KnownVF) {
// We now know that the loop is vectorizable!		// We now know that the loop is vectorizable!
// Collect instructions inside the loop that will remain uniform after		// Collect instructions inside the loop that will remain uniform after
// vectorization.		// vectorization.

// Global values, params and instructions outside of current loop are out of		// Global values, params and instructions outside of current loop are out of
// scope.		// scope.
auto isOutOfScope = [&](Value *V) -> bool {		auto isOutOfScope = [&](Value *V) -> bool {
Instruction *I = dyn_cast<Instruction>(V);		Instruction *I = dyn_cast<Instruction>(V);
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	for (auto &I : *BB) {
return getPointerOperand(U) == Ptr;		return getPointerOperand(U) == Ptr;
});		});

// Ensure the memory instruction will not be scalarized, making its		// Ensure the memory instruction will not be scalarized, making its
// pointer operand non-uniform. If the pointer operand is used by some		// pointer operand non-uniform. If the pointer operand is used by some
// instruction other than a memory access, we're not going to check if		// instruction other than a memory access, we're not going to check if
// that other instruction may be scalarized here. Thus, conservatively		// that other instruction may be scalarized here. Thus, conservatively
// assume the pointer operand may be non-uniform.		// assume the pointer operand may be non-uniform.
if (!UsersAreMemAccesses \|\| memoryInstructionMustBeScalarized(&I))		if (!UsersAreMemAccesses \|\| memoryInstructionMustBeScalarized(&I) \|\|
		(KnownVF >= 2 &&
		mssimpsoUnsubmitted Done Reply Inline Actions Why not roll this check into memoryInstructionMustBeScalarized? Either way, I think this check and the check below in isVectorizedMemAccessUse that calls memoryInstructionMustBeScalarized, should be the same. mssimpso: Why not roll this check into memoryInstructionMustBeScalarized? Either way, I think this check…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions Yes, I don't need to check legality any more. It is already implied in CM decision. delena: Yes, I don't need to check legality any more. It is already implied in CM decision.
		getCostModelDecision(&I, KnownVF) == CM_DECISION_SCALARIZE))
PossibleNonUniformPtrs.insert(Ptr);		PossibleNonUniformPtrs.insert(Ptr);

// If the memory instruction will be vectorized and its pointer operand		// If the memory instruction will be vectorized and its pointer operand
// is consecutive-like, the pointer operand should remain uniform.		// is consecutive-like, the pointer operand should remain uniform.
else if (hasConsecutiveLikePtrOperand(&I))		else if (hasConsecutiveLikePtrOperand(&I))
ConsecutiveLikePtrs.insert(Ptr);		ConsecutiveLikePtrs.insert(Ptr);

// Otherwise, if the memory instruction will be vectorized and its		// Otherwise, if the memory instruction will be vectorized and its
▲ Show 20 Lines • Show All 1,220 Lines • ▼ Show 20 Lines	static const SCEV *getAddressAccessSCEV(
return SE->getSCEV(Ptr);		return SE->getSCEV(Ptr);
}		}

static bool isStrideMul(Instruction I, LoopVectorizationLegality Legal) {		static bool isStrideMul(Instruction I, LoopVectorizationLegality Legal) {
return Legal->hasStride(I->getOperand(0)) \|\|		return Legal->hasStride(I->getOperand(0)) \|\|
Legal->hasStride(I->getOperand(1));		Legal->hasStride(I->getOperand(1));
}		}

		unsigned
		LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
		unsigned VF) {
		StoreInst *SI = dyn_cast<StoreInst>(I);
		LoadInst *LI = dyn_cast<LoadInst>(I);
		Type *ValTy = (SI ? SI->getValueOperand()->getType() : LI->getType());
		auto SE = PSE.getSE();

		unsigned Alignment = SI ? SI->getAlignment() : LI->getAlignment();
		unsigned AS =
		SI ? SI->getPointerAddressSpace() : LI->getPointerAddressSpace();
		Value *Ptr = getPointerOperand(I);
		Type *PtrTy = ToVectorTy(Ptr->getType(), VF);

		// Figure out whether the access is strided and get the stride value
		// if it's known in compile time
		const SCEV *PtrSCEV = getAddressAccessSCEV(Ptr, Legal, SE, TheLoop);

		// Get the cost of the scalar memory instruction and address computation.
		unsigned Cost = VF * TTI.getAddressComputationCost(PtrTy, SE, PtrSCEV);

		Cost += VF *
		TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(),
		Alignment, AS);

		// Get the overhead of the extractelement and insertelement instructions
		// we might create due to scalarization.
		Cost += getScalarizationOverhead(I, VF, TTI);

		// If we have a predicated store, it may not be executed for each vector
		// lane. Scale the cost by the probability of executing the predicated
		// block.
		if (Legal->isScalarWithPredication(I))
		Cost /= getReciprocalPredBlockProb();

		return Cost;
		}

		unsigned
		LoopVectorizationCostModel::getInterleaveGroupCost(Instruction *I, unsigned VF) {
		StoreInst *SI = dyn_cast<StoreInst>(I);
		LoadInst *LI = dyn_cast<LoadInst>(I);
		Type *ValTy = (SI ? SI->getValueOperand()->getType() : LI->getType());
		Type *VectorTy = ToVectorTy(ValTy, VF);
		unsigned AS =
		SI ? SI->getPointerAddressSpace() : LI->getPointerAddressSpace();

		auto Group = Legal->getInterleavedAccessGroup(I);
		assert(Group && "Fail to get an interleaved access group.");

		unsigned InterleaveFactor = Group->getFactor();
		Type WideVecTy = VectorType::get(ValTy, VF InterleaveFactor);

		// Holds the indices of existing members in an interleaved load group.
		// An interleaved store group doesn't need this as it doesn't allow gaps.
		SmallVector<unsigned, 4> Indices;
		if (LI) {
		for (unsigned i = 0; i < InterleaveFactor; i++)
		if (Group->getMember(i))
		Indices.push_back(i);
		}

		// Calculate the cost of the whole interleaved group.
		unsigned Cost = TTI.getInterleavedMemoryOpCost(
		I->getOpcode(), WideVecTy, Group->getFactor(), Indices,
		Group->getAlignment(), AS);

		if (Group->isReverse())
		Cost += Group->getNumMembers() *
		TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
		return Cost;
		}

LoopVectorizationCostModel::VectorizationCostTy		LoopVectorizationCostModel::VectorizationCostTy
LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {		LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {
// If we know that this instruction will remain uniform, check the cost of		// If we know that this instruction will remain uniform, check the cost of
// the scalar version.		// the scalar version.
if (Legal->isUniformAfterVectorization(I))		if (Legal->isUniformAfterVectorization(I))
		mssimpsoUnsubmitted Done Reply Inline Actions The VF > 1 check is not needed because you check that condition in isUniformAfterVectorization. mssimpso: The VF > 1 check is not needed because you check that condition in isUniformAfterVectorization.
		delenaAuthorUnsubmitted Not Done Reply Inline Actions Yes. You are right, thanks! delena: Yes. You are right, thanks!
VF = 1;		VF = 1;
		mssimpsoUnsubmitted Not Done Reply Inline Actions Hi Elena, I had been thinking about the use of isUniformAfterVectorization() here in getInstructionCost(). Wouldn't it now be possible for the set of uniforms to differ from the first collection (before VF selection) and the second collection (after VF selection)? So we would choose a VF based on costs assuming an instruction may or may not be uniform. Then we could later reverse our initial decision about the instruction's uniformity after VF selection, making the total cost on which we based our VF decision inaccurate. Or am I missing something? I haven't yet thought through the implications of this in enough detail to know whether this would matter much or not. mssimpso: Hi Elena, I had been thinking about the use of isUniformAfterVectorization() here in…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions About the list of Uniforms. We insert and then remove only GEPs and Induction variables. We do not calculate cost for them anyway. All other Uniform values stay in place. So, the cost is accurate at the end. There is no circular dependency here. delena: About the list of Uniforms. We insert and then remove only GEPs and Induction variables. We do…
		mssimpsoUnsubmitted Not Done Reply Inline Actions I don't think this is true in general. We mark an instruction uniform if all its users are uniform. So for example, if we have a uniform GEP whose index is some computation, that computation is also uniform if it's only used by the GEP. I think we have some examples in induction.ll, but something like this: %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ] %sum = add i64 %i, %x %idx = getelementptr inbounds float, float* %a, i64 %sum load float, float* %idx, align 4 The GEP is consecutive, so it will be marked uniform. %sum will aslo be marked uniform because it's only used by the GEP. If we later decide to scalarize the load, the GEP, the IV, and %sum will all no longer be uniform. So the cost for %sum will have been wrong. mssimpso: I don't think this is true in general. We mark an instruction uniform if all its users are…
		mssimpsoUnsubmitted Not Done Reply Inline Actions Just a thought - why not recompute and cache the uniforms (and possibly scalars) for each VF we compute costs for? That would avoid any potential logical inconsistencies. I think the compile-time overhead would probably be minimal (and you're already computing these sets twice anyway). mssimpso: Just a thought - why not recompute and cache the uniforms (and possibly scalars) for each VF we…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions Just talked with Ayal about this. I can collect Uniforms after making decision about Load/Store intructions. And the decision is based on cost. The decision affects another instructions inside the loop, as you've pointed before. Theoretically, if I have N variants of representing all memory instructions inside the loop, I should examine 2N combinations per VF. Ayal proposed the following sequence, which should be done on CM stage, after legality is finished: Per VF: Go through all memory insts and make CM decision Build Uniforms and Scalars per VF (that's what you say now) Calculate cost for VF, based on Uniforms and Scalars It is still not ideal, but, probably better than what we have. delena:** Just talked with Ayal about this. I can collect Uniforms after making decision about…
		mssimpsoUnsubmitted Not Done Reply Inline Actions Ayal's sequence makes sense to me. We should probably also try to move Uniforms/Scalars and related functions over to LoopVectorizationCostModel rather than keep them in LoopVectorizationLegality, as they will now be a function of Cost/VF and would be more appropriate there in my opinion. This could probably be a separate patch. mssimpso: Ayal's sequence makes sense to me. We should probably also try to move Uniforms/Scalars and…

if (VF > 1 && isProfitableToScalarize(I, VF))		if (VF > 1 && isProfitableToScalarize(I, VF))
return VectorizationCostTy(InstsToScalarize[VF][I], false);		return VectorizationCostTy(InstsToScalarize[VF][I], false);

Type *VectorTy;		Type *VectorTy;
unsigned C = getInstructionCost(I, VF, VectorTy);		unsigned C = getInstructionCost(I, VF, VectorTy);

bool TypeNotScalarized =		bool TypeNotScalarized =
VF > 1 && !VectorTy->isVoidTy() && TTI.getNumberOfParts(VectorTy) < VF;		VF > 1 && !VectorTy->isVoidTy() && TTI.getNumberOfParts(VectorTy) < VF;
return VectorizationCostTy(C, TypeNotScalarized);		return VectorizationCostTy(C, TypeNotScalarized);
}		}

unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,		unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,
unsigned VF,		unsigned VF,
		mssimpsoUnsubmitted Done Reply Inline Actions This should be VF == 1, since we can't have a zero VF. mssimpso: This should be VF == 1, since we can't have a zero VF.
Type *&VectorTy) {		Type *&VectorTy) {
Type *RetTy = I->getType();		Type *RetTy = I->getType();
if (canTruncateToMinimalBitwidth(I, VF))		if (canTruncateToMinimalBitwidth(I, VF))
RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);		RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);
VectorTy = ToVectorTy(RetTy, VF);		VectorTy = ToVectorTy(RetTy, VF);
auto SE = PSE.getSE();		auto SE = PSE.getSE();

// TODO: We need to estimate the cost of intrinsic calls.		// TODO: We need to estimate the cost of intrinsic calls.
switch (I->getOpcode()) {		switch (I->getOpcode()) {
case Instruction::GetElementPtr:		case Instruction::GetElementPtr:
// We mark this instruction as zero-cost because the cost of GEPs in		// We mark this instruction as zero-cost because the cost of GEPs in
// vectorized code depends on whether the corresponding memory instruction		// vectorized code depends on whether the corresponding memory instruction
// is scalarized or not. Therefore, we handle GEPs with the memory		// is scalarized or not. Therefore, we handle GEPs with the memory
// instruction cost.		// instruction cost.
return 0;		return 0;
case Instruction::Br: {		case Instruction::Br: {
return TTI.getCFInstrCost(I->getOpcode());		return TTI.getCFInstrCost(I->getOpcode());
}		}
case Instruction::PHI: {		case Instruction::PHI: {
auto *Phi = cast<PHINode>(I);		auto *Phi = cast<PHINode>(I);

// First-order recurrences are replaced by vector shuffles inside the loop.		// First-order recurrences are replaced by vector shuffles inside the loop.
if (VF > 1 && Legal->isFirstOrderRecurrence(Phi))		if (VF > 1 && Legal->isFirstOrderRecurrence(Phi))
return TTI.getShuffleCost(TargetTransformInfo::SK_ExtractSubvector,		return TTI.getShuffleCost(TargetTransformInfo::SK_ExtractSubvector,
		mkuperUnsubmitted Done Reply Inline Actions UINT_MAX mkuper: UINT_MAX
VectorTy, VF - 1, VectorTy);		VectorTy, VF - 1, VectorTy);

// TODO: IF-converted IFs become selects.		// TODO: IF-converted IFs become selects.
return 0;		return 0;
}		}
case Instruction::UDiv:		case Instruction::UDiv:
case Instruction::SDiv:		case Instruction::SDiv:
case Instruction::URem:		case Instruction::URem:
case Instruction::SRem:		case Instruction::SRem:
// If we have a predicated instruction, it may not be executed for each		// If we have a predicated instruction, it may not be executed for each
// vector lane. Get the scalarization cost and scale this amount by the		// vector lane. Get the scalarization cost and scale this amount by the
// probability of executing the predicated block. If the instruction is not		// probability of executing the predicated block. If the instruction is not
		mssimpsoUnsubmitted Not Done Reply Inline Actions Just assert inside getWideningCost() that "I" is present in the mapping and return the cost. The int to unsigned max conversion is unnecessary. mssimpso: Just assert inside getWideningCost() that "I" is present in the mapping and return the cost.
// predicated, we fall through to the next case.		// predicated, we fall through to the next case.
if (VF > 1 && Legal->isScalarWithPredication(I)) {		if (VF > 1 && Legal->isScalarWithPredication(I)) {
		mkuperUnsubmitted Not Done Reply Inline Actions Why the NumAccesses * 2 cut-off? mkuper: Why the NumAccesses * 2 cut-off?
		delenaAuthorUnsubmitted Not Done Reply Inline Actions I consider a cost per instruction. In this case InterleaveCost / NumAccesses == 1. (Matthew asked to avoid divisions). About 1 inst per access is good enough. I added more comments. delena: I consider a cost per instruction. In this case InterleaveCost / NumAccesses == 1. (Matthew…
		mkuperUnsubmitted Not Done Reply Inline Actions Well, this isn't really "about 1", this is "below 2". I'd be more conservative here (InerelaveCost <= NumAccesses ? Can this even happen? Or are you trying to catch the cases where the ratio is ~1.1-1.2?). Or maybe remove this altogether. Is getGatherScatterCost() expensive in terms of compile time? mkuper: Well, this isn't really "about 1", this is "below 2". I'd be more conservative here…
		mssimpsoUnsubmitted Not Done Reply Inline Actions The TTI estimates are supposed to be cheap to compute. I think it makes sense to remove this altogether in favor of greater simplicity. mssimpso: The TTI estimates are supposed to be cheap to compute. I think it makes sense to remove this…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions ok. delena: ok.
unsigned Cost = 0;		unsigned Cost = 0;

// These instructions have a non-void type, so account for the phi nodes		// These instructions have a non-void type, so account for the phi nodes
// that we will create. This cost is likely to be zero. The phi node		// that we will create. This cost is likely to be zero. The phi node
// cost, if any, should be scaled by the block probability because it		// cost, if any, should be scaled by the block probability because it
// models a copy at the end of each predicated block.		// models a copy at the end of each predicated block.
Cost += VF * TTI.getCFInstrCost(Instruction::PHI);		Cost += VF * TTI.getCFInstrCost(Instruction::PHI);

		mssimpsoUnsubmitted Done Reply Inline Actions Can we avoid the divisions here and below and use multiplication instead? We won't have any round-off issues that way. What about something like: unsigned Accesses = 1; if (Legal->isAccessInterleaved(&I)) { Accesses = Group->getNumMembers(); ... } ... ScalarizationCost = Accesses; ... mssimpso:* Can we avoid the divisions here and below and use multiplication instead? We won't have any…
// The cost of the non-predicated instruction.		// The cost of the non-predicated instruction.
Cost += VF * TTI.getArithmeticInstrCost(I->getOpcode(), RetTy);		Cost += VF * TTI.getArithmeticInstrCost(I->getOpcode(), RetTy);

		mkuperUnsubmitted Done Reply Inline Actions UINT_MAX mkuper: UINT_MAX
// The cost of insertelement and extractelement instructions needed for		// The cost of insertelement and extractelement instructions needed for
// scalarization.		// scalarization.
Cost += getScalarizationOverhead(I, VF, TTI);		Cost += getScalarizationOverhead(I, VF, TTI);

// Scale the cost by the probability of executing the predicated blocks.		// Scale the cost by the probability of executing the predicated blocks.
// This assumes the predicated block for each vector lane is equally		// This assumes the predicated block for each vector lane is equally
// likely.		// likely.
return Cost / getReciprocalPredBlockProb();		return Cost / getReciprocalPredBlockProb();
}		}
case Instruction::Add:		case Instruction::Add:
case Instruction::FAdd:		case Instruction::FAdd:
case Instruction::Sub:		case Instruction::Sub:
case Instruction::FSub:		case Instruction::FSub:
		mkuperUnsubmitted Done Reply Inline Actions Why do you need the "GatherScatterCost < InterleaveCost" check here? mkuper: Why do you need the "GatherScatterCost < InterleaveCost" check here?
case Instruction::Mul:		case Instruction::Mul:
		mkuperUnsubmitted Not Done Reply Inline Actions So, in case the costs are equal, you prefer scalarization to interleaving, and interleaving to scatter/gather? (In theory it shouldn't matter what happens when the costs are equal, just making sure this I understand.) mkuper: So, in case the costs are equal, you prefer scalarization to interleaving, and interleaving to…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions Matthew asked to change: "I think we're fairly consistent in other places where we compare costs, that we prefer the scalar version if there's no benefit for vectorization. So if the scalarization cost is <= the other costs, I think we should go with that." delena: Matthew asked to change: >>>"I think we're fairly consistent in other places where we compare…
		mkuperUnsubmitted Not Done Reply Inline Actions Ah, ok, missed that. Could you please add this as an explicit comment? mkuper: Ah, ok, missed that. Could you please add this as an explicit comment?
case Instruction::FMul:		case Instruction::FMul:
case Instruction::FDiv:		case Instruction::FDiv:
case Instruction::FRem:		case Instruction::FRem:
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor: {		case Instruction::Xor: {
// Since we will replace the stride by 1 the multiplication should go away.		// Since we will replace the stride by 1 the multiplication should go away.
		mssimpsoUnsubmitted Not Done Reply Inline Actions I think we're fairly consistent in other places where we compare costs, that we prefer the scalar version if there's no benefit for vectorization. So if the scalarization cost is <= the other costs, I think we should go with that. mssimpso: I think we're fairly consistent in other places where we compare costs, that we prefer the…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions Agree. I've changed the comparison. delena: Agree. I've changed the comparison.
if (I->getOpcode() == Instruction::Mul && isStrideMul(I, Legal))		if (I->getOpcode() == Instruction::Mul && isStrideMul(I, Legal))
return 0;		return 0;
// Certain instructions can be cheaper to vectorize if they have a constant		// Certain instructions can be cheaper to vectorize if they have a constant
// second vector operand. One example of this are shifts on x86.		// second vector operand. One example of this are shifts on x86.
TargetTransformInfo::OperandValueKind Op1VK =		TargetTransformInfo::OperandValueKind Op1VK =
TargetTransformInfo::OK_AnyValue;		TargetTransformInfo::OK_AnyValue;
TargetTransformInfo::OperandValueKind Op2VK =		TargetTransformInfo::OperandValueKind Op2VK =
TargetTransformInfo::OK_AnyValue;		TargetTransformInfo::OK_AnyValue;
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	case Instruction::Load: {
// instruction because only here we know whether the operation is		// instruction because only here we know whether the operation is
// scalarized.		// scalarized.
if (VF == 1)		if (VF == 1)
return TTI.getAddressComputationCost(VectorTy) +		return TTI.getAddressComputationCost(VectorTy) +
TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);		TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);

if (LI && Legal->isUniform(Ptr)) {		if (LI && Legal->isUniform(Ptr)) {
// Scalar load + broadcast		// Scalar load + broadcast
		Legal->setCostModelDecision(I, VF,
		LoopVectorizationLegality::CM_DECISION_SCALARIZE);
unsigned Cost = TTI.getAddressComputationCost(ValTy->getScalarType());		unsigned Cost = TTI.getAddressComputationCost(ValTy->getScalarType());
Cost += TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(),		Cost += TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(),
Alignment, AS);		Alignment, AS);
return Cost +		return Cost +
TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, ValTy);		TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, VectorTy);
}

// For an interleaved access, calculate the total cost of the whole
// interleave group.
if (Legal->isAccessInterleaved(I)) {
auto Group = Legal->getInterleavedAccessGroup(I);
assert(Group && "Fail to get an interleaved access group.");

// Only calculate the cost once at the insert position.
if (Group->getInsertPos() != I)
return 0;

unsigned InterleaveFactor = Group->getFactor();
Type *WideVecTy =
VectorType::get(VectorTy->getVectorElementType(),
VectorTy->getVectorNumElements() * InterleaveFactor);

// Holds the indices of existing members in an interleaved load group.
// An interleaved store group doesn't need this as it doesn't allow gaps.
SmallVector<unsigned, 4> Indices;
if (LI) {
for (unsigned i = 0; i < InterleaveFactor; i++)
if (Group->getMember(i))
Indices.push_back(i);
}		}

// Calculate the cost of the whole interleaved group.		// We assume that widening is the best solution when possible.
unsigned Cost = TTI.getInterleavedMemoryOpCost(		if (Legal->memoryInstructionCanBeWidened(I, VF)) {
I->getOpcode(), WideVecTy, Group->getFactor(), Indices,		int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);
Group->getAlignment(), AS);		assert(ConsecutiveStride != 0 &&
		"Can't widen non-consecutive memory instruction");
		bool Reverse = ConsecutiveStride < 0;

if (Group->isReverse())		unsigned Cost = TTI.getAddressComputationCost(VectorTy);
		if (Legal->isMaskRequired(I))
Cost +=		Cost +=
Group->getNumMembers() *		TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);
TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);		else
		Cost += TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);
		mssimpsoUnsubmitted Not Done Reply Inline Actions Can't this all be replaced by a call to getInterleaveGroupCost()? mssimpso: Can't this all be replaced by a call to getInterleaveGroupCost()?
		delenaAuthorUnsubmitted Not Done Reply Inline Actions I reached the conclusion that we don't need to calculate the cost again. We can keep it together with widening decision. delena: I reached the conclusion that we don't need to calculate the cost again. We can keep it…

// FIXME: The interleaved load group with a huge gap could be even more		if (Reverse)
// expensive than scalar operations. Then we could ignore such group and		Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy,
// use scalar operations instead.		0);
		Legal->setCostModelDecision(I, VF,
		LoopVectorizationLegality::CM_DECISION_WIDEN);
return Cost;		return Cost;
}		}

// Check if the memory instruction will be scalarized.		// For an interleaved access, calculate the total cost of the whole
		mssimpsoUnsubmitted Not Done Reply Inline Actions Can't this all be replaced by a call to getMemInstScalarizationCost()? mssimpso: Can't this all be replaced by a call to getMemInstScalarizationCost()?
if (Legal->memoryInstructionMustBeScalarized(I, VF)) {		// interleave group.
unsigned Cost = 0;		unsigned InterleaveGrpCost = 0;
Type *PtrTy = ToVectorTy(Ptr->getType(), VF);		unsigned InterleaveInstCost = (unsigned)(-1);
		unsigned GatherScatterCost = (unsigned)(-1);
// Figure out whether the access is strided and get the stride value		if (Legal->isAccessInterleaved(I)) {
// if it's known in compile time		auto Group = Legal->getInterleavedAccessGroup(I);
const SCEV *PtrSCEV = getAddressAccessSCEV(Ptr, Legal, SE, TheLoop);		assert(Group && "Fail to get an interleaved access group.");

// Get the cost of the scalar memory instruction and address computation.
Cost += VF * TTI.getAddressComputationCost(PtrTy, SE, PtrSCEV);
Cost += VF *
TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(),
Alignment, AS);

// Get the overhead of the extractelement and insertelement instructions
// we might create due to scalarization.
Cost += getScalarizationOverhead(I, VF, TTI);

// If we have a predicated store, it may not be executed for each vector
// lane. Scale the cost by the probability of executing the predicated
// block.
if (Legal->isScalarWithPredication(I))
Cost /= getReciprocalPredBlockProb();

return Cost;		LoopVectorizationLegality::InstWidening Decision =
		Legal->getCostModelDecision(I, VF);
		if (Decision == LoopVectorizationLegality::CM_DECISION_NONE) {
		// Only calculate the cost once for the group.
		InterleaveGrpCost = getInterleaveGroupCost(I, VF);
		mkuperUnsubmitted Not Done Reply Inline Actions Wouldn't you calculate it several times per group if it's unprofitable? mkuper: Wouldn't you calculate it several times per group if it's unprofitable?
		delenaAuthorUnsubmitted Not Done Reply Inline Actions LV_NONE means "not calculated yet". If we are here, each memory instruction will have " a decision". delena: LV_NONE means "not calculated yet". If we are here, each memory instruction will have " a…

		if (Group->getNumMembers() > 1)
		DEBUG(dbgs() << "LV: Found an estimated cost of " <<
		InterleaveGrpCost << " for VF " <<
		VF << " For interleaving group: " << *I << '\n');

		if (InterleaveGrpCost / Group->getNumMembers() <= 1) {
		// The interleaving cost is good enough.
		Legal->setCostModelDecision(Group, VF,
		LoopVectorizationLegality::CM_DECISION_INTERLEAVE);
		return InterleaveGrpCost;
		}
		// We are going to check all other options.
		InterleaveInstCost = InterleaveGrpCost / Group->getNumMembers();
		} else if (Decision == LoopVectorizationLegality::CM_DECISION_INTERLEAVE)
		// This instructions belongs to interleaving group and will be
		// vectorized inside this group.
		return 0;
		// If we are here it means that the instruction belongs to an
		// interleaving group, but interleaving is not profitable for the current
		// VF. We'll need to choose another option.
}		}

// Determine if the pointer operand of the access is either consecutive or		if (Legal->isLegalGatherOrScatter(I)) {
// reverse consecutive.		GatherScatterCost = TTI.getAddressComputationCost(VectorTy) +
int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);
bool Reverse = ConsecutiveStride < 0;

// Determine if either a gather or scatter operation is legal.
bool UseGatherOrScatter =
!ConsecutiveStride && Legal->isLegalGatherOrScatter(I);

unsigned Cost = TTI.getAddressComputationCost(VectorTy);
if (UseGatherOrScatter) {
assert(ConsecutiveStride == 0 &&
"Gather/Scatter are not used for consecutive stride");
return Cost +
TTI.getGatherScatterOpCost(I->getOpcode(), VectorTy, Ptr,		TTI.getGatherScatterOpCost(I->getOpcode(), VectorTy, Ptr,
Legal->isMaskRequired(I), Alignment);		Legal->isMaskRequired(I), Alignment);
}		}
// Wide load/stores.		unsigned ScalarizationCost = getMemInstScalarizationCost(I, VF);
if (Legal->isMaskRequired(I))
Cost +=
TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);
else
Cost += TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);

if (Reverse)		// Choose better solution for the current VF and return the cost.
Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);		// Write down this decision and use it during vectorization.
		unsigned Cost;
		LoopVectorizationLegality::InstWidening Decision;
		if (InterleaveInstCost <= GatherScatterCost &&
		InterleaveInstCost <= ScalarizationCost) {
		Decision = LoopVectorizationLegality::CM_DECISION_INTERLEAVE;
		Cost = InterleaveGrpCost;
		} else if (GatherScatterCost < InterleaveInstCost &&
		GatherScatterCost < ScalarizationCost) {
		Decision = LoopVectorizationLegality::CM_DECISION_GATHER_SCATTER;
		Cost = GatherScatterCost;
		} else {
		Cost = ScalarizationCost;
		Decision = LoopVectorizationLegality::CM_DECISION_SCALARIZE;
		mssimpsoUnsubmitted Not Done Reply Inline Actions Similar comment to the above. I don't think you have a helper for computing gather/scatter cost, but I think it would be nice. It would be easier to keep getInstructionCost in sync with setCostBasedWideningDecision. mssimpso: Similar comment to the above. I don't think you have a helper for computing gather/scatter cost…
		}
		if (auto Group = Legal->getInterleavedAccessGroup(I))
		Legal->setCostModelDecision(Group, VF, Decision);
		else
		Legal->setCostModelDecision(I, VF, Decision);
return Cost;		return Cost;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
case Instruction::PtrToInt:		case Instruction::PtrToInt:
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectValuesToIgnore() {
// Ignore type-promoting instructions we identified during reduction		// Ignore type-promoting instructions we identified during reduction
// detection.		// detection.
for (auto &Reduction : *Legal->getReductionVars()) {		for (auto &Reduction : *Legal->getReductionVars()) {
RecurrenceDescriptor &RedDes = Reduction.second;		RecurrenceDescriptor &RedDes = Reduction.second;
SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();		SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();
VecValuesToIgnore.insert(Casts.begin(), Casts.end());		VecValuesToIgnore.insert(Casts.begin(), Casts.end());
}		}

// Insert values known to be scalar into VecValuesToIgnore. This is a		// Insert values known to be scalar into VecValuesToIgnore. This is a
// conservative estimation of the values that will later be scalarized.		// conservative estimation of the values that will later be scalarized.
//		//
// FIXME: Even though an instruction is not scalar-after-vectoriztion, it may		// FIXME: Even though an instruction is not scalar-after-vectoriztion, it may
// still be scalarized. For example, we may find an instruction to be		// still be scalarized. For example, we may find an instruction to be
// more profitable for a given vectorization factor if it were to be		// more profitable for a given vectorization factor if it were to be
// scalarized. But at this point, we haven't yet computed the		// scalarized. But at this point, we haven't yet computed the
// vectorization factor.		// vectorization factor.
for (auto *BB : TheLoop->getBlocks())		for (auto *BB : TheLoop->getBlocks())
for (auto &I : *BB)		for (auto &I : *BB)
if (Legal->isScalarAfterVectorization(&I))		if (Legal->isScalarAfterVectorization(&I))
VecValuesToIgnore.insert(&I);		VecValuesToIgnore.insert(&I);
}		}
		mssimpsoUnsubmitted Not Done Reply Inline Actions Can we just delete this in favor of a helper function that checks VecValuesToIgnore and IsScalarAfterVectorization for a given VF? Something like: bool LoopVectorizationCostModel::shouldIgnoreVecValueInCostModel(Instruction I, unsigned VF) { return VecValuesToIgnore.count(I) \|\| isScalarAfterVectorization(I, VF); } This way we won't have to be imprecise. mssimpso:* Can we just delete this in favor of a helper function that checks VecValuesToIgnore and…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions Yes, it is possible. delena: Yes, it is possible.

void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,		void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,
bool IfPredicateInstr) {		bool IfPredicateInstr) {
assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");		assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");
// Holds vector parameters or scalars, in case of uniform vals.		// Holds vector parameters or scalars, in case of uniform vals.
SmallVector<VectorParts, 4> Params;		SmallVector<VectorParts, 4> Params;

setDebugLocFromInst(Builder, Instr);		setDebugLocFromInst(Builder, Instr);
▲ Show 20 Lines • Show All 444 Lines • Show Last 20 Lines

../test/Analysis/CostModel/X86/interleave-load-i32.ll

; REQUIRES: asserts		; REQUIRES: asserts
; RUN: opt -loop-vectorize -S -mcpu=skx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s		; RUN: opt -loop-vectorize -S -mcpu=skx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"		target triple = "x86_64-unknown-linux-gnu"

@A = global [10240 x i32] zeroinitializer, align 16		@A = global [10240 x i32] zeroinitializer, align 16
@B = global [10240 x i32] zeroinitializer, align 16		@B = global [10240 x i32] zeroinitializer, align 16

; Function Attrs: nounwind uwtable		; Function Attrs: nounwind uwtable
define void @load_i32_interleave4() {		define void @load_i32_interleave4() {
;CHECK-LABEL: load_i32_interleave4		;CHECK-LABEL: load_i32_interleave4
;CHECK: Found an estimated cost of 1 for VF 1 For instruction: %0 = load		;CHECK: Found an estimated cost of 5 for VF 2 For interleaving group: {{.*}} = load
mkuperUnsubmitted Not Done Reply Inline Actions What happened to the VF = 1 cost? mkuper: What happened to the VF = 1 cost?
delenaAuthorUnsubmitted Not Done Reply Inline Actions There is no group for VF 1. Instruction cost is still there. delena: There is no group for VF 1. Instruction cost is still there.
;CHECK: Found an estimated cost of 5 for VF 2 For instruction: %0 = load		;CHECK: Found an estimated cost of 5 for VF 4 For interleaving group: {{.*}} = load
;CHECK: Found an estimated cost of 5 for VF 4 For instruction: %0 = load		;CHECK: Found an estimated cost of 8 for VF 8 For interleaving group: {{.*}} = load
;CHECK: Found an estimated cost of 8 for VF 8 For instruction: %0 = load		;CHECK: Found an estimated cost of 22 for VF 16 For interleaving group: {{.*}} = load
;CHECK: Found an estimated cost of 22 for VF 16 For instruction: %0 = load
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup: ; preds = %for.body		for.cond.cleanup: ; preds = %for.body
ret void		ret void

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
Show All 15 Lines	for.body: ; preds = %entry, %for.body
store i32 %add11, i32* %arrayidx13, align 16		store i32 %add11, i32* %arrayidx13, align 16
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 4		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 4
%cmp = icmp slt i64 %indvars.iv.next, 1024		%cmp = icmp slt i64 %indvars.iv.next, 1024
br i1 %cmp, label %for.body, label %for.cond.cleanup		br i1 %cmp, label %for.body, label %for.cond.cleanup
}		}

define void @load_i32_interleave5() {		define void @load_i32_interleave5() {
;CHECK-LABEL: load_i32_interleave5		;CHECK-LABEL: load_i32_interleave5
;CHECK: Found an estimated cost of 1 for VF 1 For instruction: %0 = load		;CHECK: Found an estimated cost of 6 for VF 2 For interleaving group: {{.*}} = load
;CHECK: Found an estimated cost of 6 for VF 2 For instruction: %0 = load		;CHECK: Found an estimated cost of 9 for VF 4 For interleaving group: {{.*}} = load
;CHECK: Found an estimated cost of 9 for VF 4 For instruction: %0 = load		;CHECK: Found an estimated cost of 18 for VF 8 For interleaving group: {{.*}} = load
;CHECK: Found an estimated cost of 18 for VF 8 For instruction: %0 = load		;CHECK: Found an estimated cost of 35 for VF 16 For interleaving group: {{.*}} = load
;CHECK: Found an estimated cost of 35 for VF 16 For instruction: %0 = load
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup: ; preds = %for.body		for.cond.cleanup: ; preds = %for.body
ret void		ret void

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
Show All 24 Lines

../test/Analysis/CostModel/X86/interleave-store-i32.ll

; REQUIRES: asserts		; REQUIRES: asserts
; RUN: opt -loop-vectorize -S -mcpu=skx --debug-only=loop-vectorize < %s 2>&1\| FileCheck %s		; RUN: opt -loop-vectorize -S -mcpu=skx --debug-only=loop-vectorize < %s 2>&1\| FileCheck %s

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"		target triple = "x86_64-unknown-linux-gnu"

@A = global [10240 x i32] zeroinitializer, align 16		@A = global [10240 x i32] zeroinitializer, align 16
@B = global [10240 x i32] zeroinitializer, align 16		@B = global [10240 x i32] zeroinitializer, align 16

; Function Attrs: nounwind uwtable		; Function Attrs: nounwind uwtable
define void @store_i32_interleave4() {		define void @store_i32_interleave4() {
;CHECK-LABEL: store_i32_interleave4		;CHECK-LABEL: store_i32_interleave4
;CHECK: Found an estimated cost of 1 for VF 1 For instruction: store i32 %add16		;CHECK: Found an estimated cost of 5 for VF 2 For interleaving group: store
;CHECK: Found an estimated cost of 5 for VF 2 For instruction: store i32 %add16		;CHECK: Found an estimated cost of 5 for VF 4 For interleaving group: store
;CHECK: Found an estimated cost of 5 for VF 4 For instruction: store i32 %add16		;CHECK: Found an estimated cost of 11 for VF 8 For interleaving group: store
;CHECK: Found an estimated cost of 11 for VF 8 For instruction: store i32 %add16		;CHECK: Found an estimated cost of 22 for VF 16 For interleaving group: store
;CHECK: Found an estimated cost of 22 for VF 16 For instruction: store i32 %add16
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup: ; preds = %for.body		for.cond.cleanup: ; preds = %for.body
ret void		ret void

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
Show All 15 Lines	for.body: ; preds = %entry, %for.body
store i32 %add16, i32* %arrayidx19, align 4		store i32 %add16, i32* %arrayidx19, align 4
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 4		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 4
%cmp = icmp slt i64 %indvars.iv.next, 1024		%cmp = icmp slt i64 %indvars.iv.next, 1024
br i1 %cmp, label %for.body, label %for.cond.cleanup		br i1 %cmp, label %for.body, label %for.cond.cleanup
}		}

define void @store_i32_interleave5() {		define void @store_i32_interleave5() {
;CHECK-LABEL: store_i32_interleave5		;CHECK-LABEL: store_i32_interleave5
;CHECK: Found an estimated cost of 1 for VF 1 For instruction: store i32 %add22		;CHECK: Found an estimated cost of 7 for VF 2 For interleaving group: store
;CHECK: Found an estimated cost of 7 for VF 2 For instruction: store i32 %add22		;CHECK: Found an estimated cost of 14 for VF 4 For interleaving group: store
;CHECK: Found an estimated cost of 14 for VF 4 For instruction: store i32 %add22		;CHECK: Found an estimated cost of 21 for VF 8 For interleaving group: store
;CHECK: Found an estimated cost of 21 for VF 8 For instruction: store i32 %add22		;CHECK: Found an estimated cost of 35 for VF 16 For interleaving group: store
;CHECK: Found an estimated cost of 35 for VF 16 For instruction: store i32 %add22
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup: ; preds = %for.body		for.cond.cleanup: ; preds = %for.body
ret void		ret void

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
Show All 24 Lines

../test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll

				; REQUIRES: asserts
				; RUN: opt < %s -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -S --debug-only=loop-vectorize 2>&1 \| FileCheck %s

				; This test shows extremely high interleaving cost that, probably, should be fixed.
				; Due to the high cost, interleaving is not beneficial and the cost model chooses to scalarize
				; the load instructions.

				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				%pair = type { i8, i8 }

				; CHECK-LABEL: test
				; CHECK: Found an estimated cost of 79 for VF 2 For interleaving group: {{.*}} load i8
				; CHECK: Found an estimated cost of 10 for VF 2 For instruction: {{.*}} load i8
				; CHECK: vector.body
				; CHECK: load i8
				; CHECK: load i8
				; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body

				define void @test(%pair* %p, i64 %n) {
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
				%tmp0 = getelementptr %pair, %pair* %p, i64 %i, i32 0
				%tmp1 = load i8, i8* %tmp0, align 1
				%tmp2 = getelementptr %pair, %pair* %p, i64 %i, i32 1
				%tmp3 = load i8, i8* %tmp2, align 1
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp eq i64 %i.next, %n
				br i1 %cond, label %for.end, label %for.body

				for.end:
				ret void
				}

../test/Transforms/LoopVectorize/X86/consecutive-ptr-uniforms.ll

	Show All 12 Lines
	; scatter operation. %tmp3 (and the induction variable) should not be marked			; scatter operation. %tmp3 (and the induction variable) should not be marked
	; uniform-after-vectorization.			; uniform-after-vectorization.
	;			;
	; CHECK: LV: Found uniform instruction: %tmp0 = getelementptr inbounds %data, %data* %d, i64 0, i32 3, i64 %i			; CHECK: LV: Found uniform instruction: %tmp0 = getelementptr inbounds %data, %data* %d, i64 0, i32 3, i64 %i
	; CHECK-NOT: LV: Found uniform instruction: %tmp3 = getelementptr inbounds %data, %data* %d, i64 0, i32 0, i64 %i			; CHECK-NOT: LV: Found uniform instruction: %tmp3 = getelementptr inbounds %data, %data* %d, i64 0, i32 0, i64 %i
	; CHECK-NOT: LV: Found uniform instruction: %i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]			; CHECK-NOT: LV: Found uniform instruction: %i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
	; CHECK-NOT: LV: Found uniform instruction: %i.next = add nuw nsw i64 %i, 5			; CHECK-NOT: LV: Found uniform instruction: %i.next = add nuw nsw i64 %i, 5
	; CHECK: vector.body:			; CHECK: vector.body:
				; CHECK: %index = phi i64
	; CHECK: %vec.ind = phi <16 x i64>			; CHECK: %vec.ind = phi <16 x i64>
	; CHECK: %[[T0:.+]] = extractelement <16 x i64> %vec.ind, i32 0			; CHECK: %[[T0:.+]] = mul i64 %index, 5
	; CHECK: %[[T1:.+]] = getelementptr inbounds %data, %data* %d, i64 0, i32 3, i64 %[[T0]]			; CHECK: %[[T1:.+]] = getelementptr inbounds %data, %data* %d, i64 0, i32 3, i64 %[[T0]]
	; CHECK: %[[T2:.+]] = bitcast float* %[[T1]] to <80 x float>*			; CHECK: %[[T2:.+]] = bitcast float* %[[T1]] to <80 x float>*
	; CHECK: load <80 x float>, <80 x float>* %[[T2]], align 4			; CHECK: load <80 x float>, <80 x float>* %[[T2]], align 4
	; CHECK: %[[T3:.+]] = getelementptr inbounds %data, %data* %d, i64 0, i32 0, i64 %[[T0]]			; CHECK: %[[T3:.+]] = getelementptr inbounds %data, %data* %d, i64 0, i32 0, i64 %[[T0]]
	; CHECK: %[[T4:.+]] = bitcast float* %[[T3]] to <80 x float>*			; CHECK: %[[T4:.+]] = bitcast float* %[[T3]] to <80 x float>*
	; CHECK: load <80 x float>, <80 x float>* %[[T4]], align 4			; CHECK: load <80 x float>, <80 x float>* %[[T4]], align 4
	; CHECK: %VectorGep = getelementptr inbounds %data, %data* %d, i64 0, i32 0, <16 x i64> %vec.ind			; CHECK: %VectorGep = getelementptr inbounds %data, %data* %d, i64 0, i32 0, <16 x i64> %vec.ind
	; CHECK: call void @llvm.masked.scatter.v16f32({{.}}, <16 x float> %VectorGep, {{.*}})			; CHECK: call void @llvm.masked.scatter.v16f32({{.}}, <16 x float> %VectorGep, {{.*}})
	Show All 26 Lines

../test/Transforms/LoopVectorize/X86/gather-vs-interleave.ll

				; RUN: opt -loop-vectorize -S -mcpu=skylake-avx512 < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; This test checks that "gather" operation is choosen since it's cost is better
				; than interleaving pattern.
				;
				;unsigned long A[SIZE];
				;unsigned long B[SIZE];
				;
				;void foo() {
				; for (int i=0; i<N; i+=8) {
				; B[i] = A[i] + 5;
				; }
				;}

				@A = global [10240 x i64] zeroinitializer, align 16
				@B = global [10240 x i64] zeroinitializer, align 16


				; CHECK_LABEL: strided_load_i64
				; CHECK: masked.gather
				define void @strided_load_i64() {
				br label %1

				; <label>:1: ; preds = %0, %1
				%indvars.iv = phi i64 [ 0, %0 ], [ %indvars.iv.next, %1 ]
				%2 = getelementptr inbounds [10240 x i64], [10240 x i64]* @A, i64 0, i64 %indvars.iv
				%3 = load i64, i64* %2, align 16
				%4 = add i64 %3, 5
				%5 = getelementptr inbounds [10240 x i64], [10240 x i64]* @B, i64 0, i64 %indvars.iv
				store i64 %4, i64* %5, align 16
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 8
				%6 = icmp slt i64 %indvars.iv.next, 1024
				br i1 %6, label %1, label %7

				; <label>:7: ; preds = %1
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[Loop Vectorizer] Interleave vs Gather - in some cases Gather is better.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 84811

../lib/Transforms/Vectorize/LoopVectorize.cpp

../test/Analysis/CostModel/X86/interleave-load-i32.ll

../test/Analysis/CostModel/X86/interleave-store-i32.ll

../test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll

../test/Transforms/LoopVectorize/X86/consecutive-ptr-uniforms.ll

../test/Transforms/LoopVectorize/X86/gather-vs-interleave.ll

[Loop Vectorizer] Interleave vs Gather - in some cases Gather is better.
ClosedPublic