This is an archive of the discontinued LLVM Phabricator instance.

[Loop Vectorizer] Interleave vs Gather - in some cases Gather is better.
ClosedPublic

Authored by delena on Dec 19 2016, 7:10 AM.

Details

Summary

The bug is described in PR31426.
The cost of Load instruction is calculated in the following order isConsecultive - isInterleave - isGather - scalar.
When a Load instruction belongs to Interleave group, the "Gather" option is not checked at all. But when the interleave factor exceeds the maximum, the cost is high and the "Gather" is preferred in this case. The following loop is not vectorized on AVX-512 due to this bug:
for (i=0; i<N; ++i)

B[i] = A[i*5]

Diff Detail

Repository
rL LLVM

Event Timeline

delena updated this revision to Diff 81947.Dec 19 2016, 7:10 AM
delena retitled this revision from to [Loop Vectorizer] Interleave vs Gather - in some cases Gather is better..
delena updated this object.
delena added reviewers: mkuper, Ayal, anemet.
delena set the repository for this revision to rL LLVM.
delena added a subscriber: llvm-commits.
delena updated this revision to Diff 81949.Dec 19 2016, 7:21 AM

Added some comments.

mssimpso added inline comments.Dec 19 2016, 9:23 AM
../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7047–7049 ↗(On Diff #81949)

Why don't you just compare the costs? You wouldn't need to make this assumption anymore.

mkuper added inline comments.Dec 19 2016, 10:11 AM
../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7052 ↗(On Diff #81949)

Aren't we already checking this in selectInterleaveCount()?
How do we end up with interleave factors above MaxInterleaveFactor in the first place?

mssimpso added inline comments.Dec 19 2016, 10:25 AM
../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7052 ↗(On Diff #81949)

Ah, our naming conventions have confused this somewhat, I think. TTI.getMaxInterleaveFacor is the hook for the max unroll factor ("interleaving"). I think Elana was wanting TLI.getMaxSupportedInterleaveFactor instead. This is the hook for determining the max factor of the interleaved access groups.

mkuper added inline comments.Dec 19 2016, 10:31 AM
../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7052 ↗(On Diff #81949)

Argh, right.
Sorry for the stupid question, I just looked at this and went "this looks odd" without actually reading the context. But yes, Elena, you want the other function.

Regardless, can we maybe change the naming to something sane? As a separate patch, of course. (I don't have any good ideas, though.)

delena added inline comments.Dec 20 2016, 6:04 AM
../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7047–7049 ↗(On Diff #81949)

The cost that we provide for interleaved access is incorrect, specially for AVX-512. AVX-512 has 3-src shuffles and the real cost is much lower . I can't compare it to Gather - the Gather cost wins today, even for small stride, but it is not true. So there are 2 bugs:

(1) The loop is scalarized and Gather/Scatter option is not considered at all
(2) Incorrect cost for interleaving

I can start from providing a correct cost for interleaving on AVX-512.
Or I can fix the (1) first of all. I'll retrieve the proper "MaxSupportedInterleaveFactor".

What do you think?

mkuper added inline comments.Dec 20 2016, 11:13 AM
../../ver4/lib/Transforms/Vectorize/LoopVectorize.cpp
7047–7049 ↗(On Diff #81949)

I think it would be better to fix the cost model first.

It's very pessimistic for x86 in general, not AVX-512, but you're right, it's even worse for AVX-512, because the real cost is lower. But I thought Farhana was already working on that. Am I confused?

delena updated this revision to Diff 83227.Jan 5 2017, 6:06 AM

Now, when the interleaving cost calculation is correct, I compare the cost of 3 possible options - interleave, gather and scalar and choose the better option.
The decision made by cost model should be saved and latter used when we generate the vector code.

The code is refactored in order to allow this comparison, but the comparison is the *only* functional change in the code. The rest of the logic remains the same.

I added a test case that demonstrates gather - vs - interleave case.

delena updated this revision to Diff 83607.Jan 9 2017, 4:05 AM

Merged the patch with the latest changes in LV.

Hi Elena,

This patch causes a crash in spec2006/povray on AArch64. I've pasted a test case over at P7951. The problem has to do with the analysis in collectLoopUniforms and the new decision to scalarize. collectLoopUniforms is very conservative about what instructions remain uniform after vectorization. If a memory access has the possibility of being scalarized (even though it may not be), it's pointer operand is not marked uniform. What's happening here is that you've introduced a new scalarization decision based on the cost model that collectLoopUniforms doesn't know about (and likely can't know about). In the test case, collectLoopUniforms marks the pointer operands of the loads uniform even though the loads are now scalarized, which causes the pointer operands to be non-uniform.

I'm not really sure what the best way to fix this would be. Any thoughts? One way would be to remove scalarization from your widening decision.

mssimpso removed a subscriber: mssimpso.
mkuper edited edge metadata.Jan 16 2017, 11:15 PM

Hi Elena,

I'm not really sure what the best way to fix this would be. Any thoughts? One way would be to remove scalarization from your widening decision.

That sounds a bit unfortunate. I think we'd rather like to the cost model to be able to say "please scalarize this".

../lib/Transforms/Vectorize/LoopVectorize.cpp
1702 ↗(On Diff #83607)

I'm not a fan of the name of this enum, but don't really have any good ideas. Anyone else?

1725 ↗(On Diff #83607)

This looks like a suboptimal way to iterate over InterleaveGroup, because it requires a lookup for every index, instead of just iterating above the Members collection. (we don't care about the order here, right?)
Perhaps add a better interface?

1731 ↗(On Diff #83607)

Is this really a per-instruction thing, or do we just always end up with LV_NONE when the cost model is not in use (e.g. explicit user-provided VF).

If the former, when does this happen?
If the latter, then I think the comment is a bit misleading - and it would be good to be able to assert on this happening only when there's no cost model.

Edit: Oh, I think I see, you care about temporary LV_NONE values for Interleave groups. I still think it'd be nice if we could somehow distinguish between the cases, so we could assert on not having a cost when we should.

2878 ↗(On Diff #83607)

Extra space after =.

2891 ↗(On Diff #83607)

assert(!Legal->memoryInstructionMustBeScalarized(Instr, VF)), maybe?

Because now we leave the door open for the cost model decision being to widen, even though memoryInstructionMustBeScalarized().
Another option is to replace the if with:

if (Decision == LoopVectorizationLegality::LV_SCALARIZE ||     
    Legal->memoryInstructionMustBeScalarized(Instr, VF))

But I'm not sure it makes sense, because there's no good reason to always call memoryInstructionMustBeScalarized() in a non-asserts build - in theory, the cost model should have already checked this.

7168 ↗(On Diff #83607)

Wouldn't you calculate it several times per group if it's unprofitable?

../test/Analysis/CostModel/X86/interleave-load-i32.ll
13 ↗(On Diff #83607)

What happened to the VF = 1 cost?

delena marked an inline comment as done.Jan 17 2017, 1:59 AM

Hi Elena,

This patch causes a crash in spec2006/povray on AArch64. I've pasted a test case over at P7951. The problem has to do with the analysis in collectLoopUniforms and the new decision to scalarize. collectLoopUniforms is very conservative about what instructions remain uniform after vectorization. If a memory access has the possibility of being scalarized (even though it may not be), it's pointer operand is not marked uniform. What's happening here is that you've introduced a new scalarization decision based on the cost model that collectLoopUniforms doesn't know about (and likely can't know about). In the test case, collectLoopUniforms marks the pointer operands of the loads uniform even though the loads are now scalarized, which causes the pointer operands to be non-uniform.

I'm not really sure what the best way to fix this would be. Any thoughts? One way would be to remove scalarization from your widening decision.

I'm trying to revisit collections of Uniforms and Scalars after cost estimation and remove GEPs and induction variables if we decided to scalarize. Not finished yet..

../lib/Transforms/Vectorize/LoopVectorize.cpp
1702 ↗(On Diff #83607)

CM_DECISION_NONE,
CM_DECISION_WIDEN .. ?

1725 ↗(On Diff #83607)

I agree, but iteration inside Factor is not a big overhead and it used in one more place, at least. This patch is complex enough, I'd postpone unrelated changes to another patch.

2891 ↗(On Diff #83607)

The form I used just prevents a redundant call to memoryInstructionMustBeScalarized(). Because we may have a decision (INTERLEAVE, WIDEN or GATHER_SCATTER) at this stage.

7168 ↗(On Diff #83607)

LV_NONE means "not calculated yet". If we are here, each memory instruction will have " a decision".

../test/Analysis/CostModel/X86/interleave-load-i32.ll
13 ↗(On Diff #83607)

There is no group for VF 1. Instruction cost is still there.

mssimpso edited edge metadata.Jan 17 2017, 11:42 AM

I'm trying to revisit collections of Uniforms and Scalars after cost estimation and remove GEPs and induction variables if we decided to scalarize. Not finished yet..

That sounds like a good idea to me, and I hope we can separate the uniform/scalar collection from the cost estimation in a way that makes sense. When I looked at doing that a while back, the complication I ran into was that the cost estimates depend on knowing the uniforms. But if we use the cost estimates to help determine the uniforms, then we end up with a weird circular dependence. If we mark something uniform/scalar after computing the costs, we will need to update the costs somehow. That in turn might change what we would like to scalarize, etc.

delena updated this revision to Diff 84811.Jan 18 2017, 2:40 AM

I'm revisiting the set of Scalars and Uniforms after cost modeling. Taking into account that the cost model uses Uniforms and the Uniforms are changed after the cost modeling, I still think that there is no circular dependency here.
(1) I remove uniform GEP (and the corresponding induction) when we decide to scalarize memory instruction.
(2) I do not remove GEP from scalars if it will not be used in Gather/Scatter.

I added a test that Matthew sent me.

mssimpso added inline comments.Jan 18 2017, 9:24 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009 ↗(On Diff #84811)

Hi Elena,

I had been thinking about the use of isUniformAfterVectorization() here in getInstructionCost(). Wouldn't it now be possible for the set of uniforms to differ from the first collection (before VF selection) and the second collection (after VF selection)? So we would choose a VF based on costs assuming an instruction may or may not be uniform. Then we could later reverse our initial decision about the instruction's uniformity after VF selection, making the total cost on which we based our VF decision inaccurate. Or am I missing something? I haven't yet thought through the implications of this in enough detail to know whether this would matter much or not.

delena added inline comments.Jan 19 2017, 4:40 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009 ↗(On Diff #84811)

About the list of Uniforms. We insert and then remove only GEPs and Induction variables. We do not calculate cost for them anyway. All other Uniform values stay in place. So, the cost is accurate at the end. There is no circular dependency here.

mssimpso added inline comments.Jan 19 2017, 5:18 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009 ↗(On Diff #84811)

I don't think this is true in general. We mark an instruction uniform if all its users are uniform. So for example, if we have a uniform GEP whose index is some computation, that computation is also uniform if it's only used by the GEP. I think we have some examples in induction.ll, but something like this:

%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
%sum = add i64 %i, %x
%idx = getelementptr inbounds float, float* %a, i64 %sum
load float, float* %idx, align 4

The GEP is consecutive, so it will be marked uniform. %sum will aslo be marked uniform because it's only used by the GEP. If we later decide to scalarize the load, the GEP, the IV, and %sum will all no longer be uniform. So the cost for %sum will have been wrong.

mssimpso added inline comments.Jan 19 2017, 6:19 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009 ↗(On Diff #84811)

Just a thought - why not recompute and cache the uniforms (and possibly scalars) for each VF we compute costs for? That would avoid any potential logical inconsistencies. I think the compile-time overhead would probably be minimal (and you're already computing these sets twice anyway).

delena added inline comments.Jan 19 2017, 7:18 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009 ↗(On Diff #84811)

Just talked with Ayal about this.

I can collect Uniforms after making decision about Load/Store intructions. And the decision is based on cost. The decision affects another instructions inside the loop, as you've pointed before. Theoretically, if I have N variants of representing all memory instructions inside the loop, I should examine 2**N combinations per VF.

Ayal proposed the following sequence, which should be done on CM stage, after legality is finished:
Per VF:

  1. Go through all memory insts and make CM decision
  2. Build Uniforms and Scalars per VF (that's what you say now)
  3. Calculate cost for VF, based on Uniforms and Scalars

It is still not ideal, but, probably better than what we have.

mssimpso added inline comments.Jan 19 2017, 7:44 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
7008–7009 ↗(On Diff #84811)

Ayal's sequence makes sense to me. We should probably also try to move Uniforms/Scalars and related functions over to LoopVectorizationCostModel rather than keep them in LoopVectorizationLegality, as they will now be a function of Cost/VF and would be more appropriate there in my opinion. This could probably be a separate patch.

delena updated this revision to Diff 85764.Jan 25 2017, 9:04 AM

I moved Uniforms and Scalars from Legality to the Cost Model. Now we collect Uniforms and Scalars per VF and these collections depend on widening decisions that CM takes for Load/Store instructions.

The patch is big, but I did not see how to split it into separate patches.

delena added a subscriber: dorit.Jan 26 2017, 12:21 AM

Ping *
We are in sync with Ayal and Gil working on VPlan.

Hi Elena,

I'll take a look at this again today. Thanks for the reminder!

Hi Elena,

Thanks for your patience. I haven't yet looked in detail at the widening decision selection, but here are some comments around the uniforms/scalars.

Matt.

../lib/Transforms/Vectorize/LoopVectorize.cpp
5616–5619 ↗(On Diff #85764)

This is not true now, right? An interleaved access may be scalarized based on the cost model.

5625–5626 ↗(On Diff #85764)

Can we change this to something like:

if (Uniforms.count(VF))
  return;

auto &UniformsVF = Uniforms[VF];

We want to distinguish the case that (1) Uniforms have not been computed for VF from (2) Uniforms have been computed for VF but there aren't any, so we don't need to compute them again. We can end up calling this twice for the same VF if we have a user-selected VF and then compute the expected cost for interleaving. This is similar to the way we do this check in collectInstsToScalarize.

This will also apply to collectLoopScalars.

5687 ↗(On Diff #85764)

Why not roll this check into memoryInstructionMustBeScalarized? Either way, I think this check and the check below in isVectorizedMemAccessUse that calls memoryInstructionMustBeScalarized, should be the same.

7007 ↗(On Diff #85764)

The VF > 1 check is not needed because you check that condition in isUniformAfterVectorization.

7022 ↗(On Diff #85764)

This should be VF == 1, since we can't have a zero VF.

7466–7479 ↗(On Diff #85764)

Can we just delete this in favor of a helper function that checks VecValuesToIgnore and IsScalarAfterVectorization for a given VF? Something like:

bool LoopVectorizationCostModel::shouldIgnoreVecValueInCostModel(Instruction *I, unsigned VF) {
  return VecValuesToIgnore.count(I) || isScalarAfterVectorization(I, VF);
}

This way we won't have to be imprecise.

delena marked 3 inline comments as done.Jan 31 2017, 7:22 AM
delena added inline comments.
../lib/Transforms/Vectorize/LoopVectorize.cpp
5616–5619 ↗(On Diff #85764)

It is still true. Instruction must be scalarized if there is no any other option. This is "legality" check, cost model comes later.

5625–5626 ↗(On Diff #85764)

Yes. I've changed.

5687 ↗(On Diff #85764)

Yes, I don't need to check legality any more. It is already implied in CM decision.

7007 ↗(On Diff #85764)

Yes. You are right, thanks!

7466–7479 ↗(On Diff #85764)

Yes, it is possible.

delena updated this revision to Diff 86431.Jan 31 2017, 7:31 AM
delena marked 2 inline comments as done.

Updated, following Matthew's comments.

Hi Elena,

Here are some more inline comments. Also, please clang-format the patch if you haven't already done so to make the review easier. Thanks!

../lib/Transforms/Vectorize/LoopVectorize.cpp
1930 ↗(On Diff #86431)

How are we using CM_DECISION_NONE? Aren't we forced to make some sort of decision? I would think the default would be CM_DECISION_WIDEN unless the cost model points to one of the others.

2054–2055 ↗(On Diff #86431)

The VF > 2 comment can be removed now.

2062–2063 ↗(On Diff #86431)

The VF > 2 comment can be removed now.

5503–5504 ↗(On Diff #86431)

Since you've added a new check in collectUnifomsAndScalars to ensure the analysis is performed only once, the check here and in collectLoopUniforms is redundant. Should these be asserts now?

5556–5563 ↗(On Diff #86431)

I don't think a see a real use for this function anymore. Please see my related comment about memoryAccessMustBeScalarized. This function was only ever used in collectLoopUniforms to help determine how a memory access would be vectorized. I think you can probably greatly simplify the logic in collectLoopUniforms and remove it. Now, we know what the vectorizer will do based the the cost model decision.

For a GEP to remain uniform, I think we just need to know that all its users are CM_DECISION_INTERLEAVE or CM_DECISION_WIDEN. Is this right? If so, please work it into the GEP part of collectLoopUniforms.

5616–5619 ↗(On Diff #85764)

I think the cost vs legality distinction is not important here. My original intent with this function was just to consolidate all the scalarization conditions for a given access. That way it could be called when collecting uniforms, computing costs, and vectorizing and they all would agree on what would happen. Perhaps the choice of name was misleading?

In any case, unless I'm missing something, I don't think you actually use this function anymore. You've replaced all uses with

getWideningDecision(I, VF) == CM_DECISION_SCALARIZE

This makes sense because I think you've moved all the non-cost related scalarization decisions into the inverse: memoryInstructionCanBeWidened. Can this function be deleted now?

delena marked 2 inline comments as done.Feb 1 2017, 7:42 AM

Hi Elena,

Here are some more inline comments. Also, please clang-format the patch if you haven't already done so to make the review easier. Thanks!

Done.

../lib/Transforms/Vectorize/LoopVectorize.cpp
1930 ↗(On Diff #86431)

I used it while capturing decisions for an interleave group. At this point I can get rid of it. I return "NONE" if no decision, and it is convenient.

InstWidening getWideningDecision(Instruction *I, unsigned VF) {
    assert(VF >= 2 && "Expected VF >=2");
    std::pair<Instruction *, unsigned> InstOnVF = std::make_pair(I, VF);
    if (!WideningDecisions.count(InstOnVF))
      return CM_DECISION_NONE;
    return WideningDecisions[InstOnVF];
  }

Then I use it in assertion, to make sure that decision is made.

5503–5504 ↗(On Diff #86431)

May be. Until somebody will decide to call these functions separately. Right now I don't see a reason. I'll put an "assert" and add a comment.

5556–5563 ↗(On Diff #86431)

I removed mustBeScalarized(). (I suppose you meant this function, the lines are mixed up)

delena updated this revision to Diff 86636.Feb 1 2017, 7:44 AM
mssimpso added inline comments.Feb 1 2017, 7:54 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
5556–5563 ↗(On Diff #86431)

In this comment I was talking about hasConsecutiveLikePtrOperand. I don't think it's needed anymore and can probably be deleted. It's only used in collectLoopUniforms to help determine if a GEP will remain uniform. But can't we now determine that much easier by just checking the CM_DECISION of the accesses? Does that make sense?

delena updated this revision to Diff 86780.Feb 2 2017, 2:05 AM

I removed hasConsecutiveLikePtrOperand and simplified the code.

Hi Elena,

I like the direction this patch is going. Thanks for all your work. Here are some more comments inline.

../lib/Transforms/Vectorize/LoopVectorize.cpp
7053 ↗(On Diff #86780)

Can we avoid the divisions here and below and use multiplication instead? We won't have any round-off issues that way. What about something like:

unsigned Accesses = 1;
if (Legal->isAccessInterleaved(&I)) {
  Accesses = Group->getNumMembers();
  ...
}
...
ScalarizationCost *= Accesses;
...
7072–7080 ↗(On Diff #86780)

I think we're fairly consistent in other places where we compare costs, that we prefer the scalar version if there's no benefit for vectorization. So if the scalarization cost is <= the other costs, I think we should go with that.

7255–7284 ↗(On Diff #86780)

Can't this all be replaced by a call to getInterleaveGroupCost()?

7291–7314 ↗(On Diff #86780)

Can't this all be replaced by a call to getMemInstScalarizationCost()?

7323–7327 ↗(On Diff #86780)

Similar comment to the above. I don't think you have a helper for computing gather/scatter cost, but I think it would be nice. It would be easier to keep getInstructionCost in sync with setCostBasedWideningDecision.

delena marked an inline comment as done.Feb 6 2017, 4:44 AM
delena added inline comments.
../lib/Transforms/Vectorize/LoopVectorize.cpp
7072–7080 ↗(On Diff #86780)

Agree. I've changed the comparison.

7255–7284 ↗(On Diff #86780)

I reached the conclusion that we don't need to calculate the cost again. We can keep it together with widening decision.

delena updated this revision to Diff 87215.Feb 6 2017, 4:59 AM

The memory instruction cost that was calculated during widening decision is saved in WideningDecisions map in order to avoid recalculation.

Hi Elena,

I have one comment about getWideningCost; otherwise, the patch looks fine to me now. But please let Michael have one more pass at review since he's been quiet for a while. Thanks!

../lib/Transforms/Vectorize/LoopVectorize.cpp
1995–1996 ↗(On Diff #87215)

This seems weird to me. All instructions are supposed to have a cost computed by the cost model. I would much rather us assert that I is in WideningDecisions. Isn't it true that if I is a load or store, we would have already computed a cost and saved it in WideningDecisions?

7099 ↗(On Diff #87215)

Just assert inside getWideningCost() that "I" is present in the mapping and return the cost. The int to unsigned max conversion is unnecessary.

delena updated this revision to Diff 87281.Feb 6 2017, 12:16 PM

Updated according to the resent Matthew's comments.
Matthew, thanks a lot for your review.

Michael, could you, please, take a look?

mkuper added a comment.Feb 6 2017, 5:17 PM

The high-level structure of this looks good to me, thanks Elena!

Some minor/style comments inline.

../lib/Transforms/Vectorize/LoopVectorize.cpp
330 ↗(On Diff #87281)

We probably already have several clones of those functions around the code-base. And they are probably all slightly different.
LSR has getAccessType(), LAA has getAddressSpaceOperand(), LoadStoreVectorizer has getPointerAddressSpace(), and I'm sure there are more.
I don't want to make merging them a precondition of this patch, but can you please at least add a FIXME here?

1929 ↗(On Diff #87281)

Why not just "return Uniforms[VF]->second.count(I);"? I don't think the verbosity helps here, and we don't actually care about the difference between find() and operator[] due to the assert.
Or, to save a lookup in an asserts build, you could find() and then assert on the result of the find().

1937 ↗(On Diff #87281)

Same as above.

1988 ↗(On Diff #87281)

Here, on the other hand, I think find() would be better - that way you don't need two lookups.

2120 ↗(On Diff #87281)

I'm still not sure I understand why this gets called twice for a user-provided VF. Could you explain again?

5512 ↗(On Diff #87281)

Do we still want this check here?
I mean:

  1. An instruction with a uniform pointer *can* be widended.
  2. That ties in with the way this is used - we check for the uniform case before the widening case, as we should. So, IIUC, we should never actually hit this.
7022 ↗(On Diff #87281)

UINT_MAX

7036 ↗(On Diff #87281)

Why the NumAccesses * 2 cut-off?

7047 ↗(On Diff #87281)

UINT_MAX

7060 ↗(On Diff #87281)

Why do you need the "GatherScatterCost < InterleaveCost" check here?

1930 ↗(On Diff #86431)

Maybe rename NONE to UNKNOWN, then? But I'm fine with None if you think that's clearer.
Also, I looked up the naming convention for enums in the coding standard, and I think it should be something like "CM_Widen, CM_Interleave", etc, not all caps.

delena marked 5 inline comments as done.Feb 7 2017, 12:48 AM
delena added inline comments.
../lib/Transforms/Vectorize/LoopVectorize.cpp
1929 ↗(On Diff #87281)

The "const" qualifier does not allow the Uniforms[VF] form.

2120 ↗(On Diff #87281)

We call this function from multiple places. From calculateRegisterUsage(), selectVectorizationFactor() - user-defined-VF and from expectedCost().
I just prevent recalculation.

7036 ↗(On Diff #87281)

I consider a cost per instruction. In this case InterleaveCost / NumAccesses == 1. (Matthew asked to avoid divisions).
About 1 inst per access is good enough. I added more comments.

delena updated this revision to Diff 87371.Feb 7 2017, 1:03 AM

Some code improvements, addressed Michael's comments.

mkuper added inline comments.Feb 7 2017, 9:27 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
7069 ↗(On Diff #87371)

So, in case the costs are equal, you prefer scalarization to interleaving, and interleaving to scatter/gather?
(In theory it shouldn't matter what happens when the costs are equal, just making sure this I understand.)

1929 ↗(On Diff #87281)

Ohh, right, missed it's const, sorry.

2120 ↗(On Diff #87281)

I understand, I'm just wondering whether we really need to do that.
Anyway, it's not a new problem, we don't have to solve it here.

5512 ↗(On Diff #87281)

I think you may have missed this comment.

7036 ↗(On Diff #87281)

Well, this isn't really "about 1", this is "below 2".
I'd be more conservative here (InerelaveCost <= NumAccesses ? Can this even happen? Or are you trying to catch the cases where the ratio is ~1.1-1.2?). Or maybe remove this altogether. Is getGatherScatterCost() expensive in terms of compile time?

mssimpso added inline comments.Feb 7 2017, 9:36 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
7036 ↗(On Diff #87281)

The TTI estimates are supposed to be cheap to compute. I think it makes sense to remove this altogether in favor of greater simplicity.

delena added inline comments.Feb 8 2017, 1:02 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
7069 ↗(On Diff #87371)

Matthew asked to change:

"I think we're fairly consistent in other places where we compare costs, that we prefer the scalar version if there's no benefit for vectorization. So if the scalarization cost is <= the other costs, I think we should go with that."

2120 ↗(On Diff #87281)

In the current version, before my changes, we calculate Uniforms and Scalars once and do this in Legality. In this patch, I moved Uniforms and Scalars from Legality to the Cost Model and calculate them per VF. I did not find a single place to put the call, I'm calling the collectUniformsAndScalars() from multiple places. Doing that, I want to prevent the data recalculation.

I suppose that finding a right single place for calling collectUniformsAndScalars() is possible, but it will require additional movements in selectVectorizationFactor(). I think it can be done in a separate patch.

5512 ↗(On Diff #87281)

I'll fix.

7036 ↗(On Diff #87281)

ok.

delena updated this revision to Diff 87604.Feb 8 2017, 1:12 AM

Some fixes following Michael's recent comments.

mkuper accepted this revision.Feb 8 2017, 10:21 AM

LGTM

../lib/Transforms/Vectorize/LoopVectorize.cpp
7069 ↗(On Diff #87371)

Ah, ok, missed that. Could you please add this as an explicit comment?

2120 ↗(On Diff #87281)

Ah, ok, got it. Sure, sounds good as a follow-up.

This revision is now accepted and ready to land.Feb 8 2017, 10:21 AM
This revision was automatically updated to reflect the committed changes.