This is an archive of the discontinued LLVM Phabricator instance.

[LV][LAA] Vectorize loop invariant values stored into loop invariant address
ClosedPublic

Authored by anna on Aug 13 2018, 2:29 PM.

Details

Summary

We are overly conservative in loop vectorizer with respect to stores to loop
invariant addresses.
More details in https://bugs.llvm.org/show_bug.cgi?id=38546
This is the first part of the fix where we start with vectorizing loop invariant
values to loop invariant addresses.

Diff Detail

Repository
rL LLVM

Event Timeline

anna created this revision.Aug 13 2018, 2:29 PM
Ayal added a comment.Aug 14 2018, 1:53 PM

The decision how to vectorize invariant stores also deserves attention: LoopVectorizationCostModel::setCostBasedWideningDecision() considers loads from uniform addresses, but not invariant stores - these may end up being scalarized or becoming a scatter; the former is preferred in this case, as the identical scalarized replicas can later be removed. In any case associated cost estimates should be provided to support overall vectorization costs. Note that vectorizing conditional invariant stores deserves special attention. Unconditional invariant stores are candidates to be sunk out of the loop, preferably before trying to vectorize it. One approach to vectorize a conditional invariant store is to check if its mask is all false, and if not to perform a single invariant scalar store, for lack of a masked-scalar-store instruction. May be worth distinguishing between uniform and divergent conditions; this check is easier to carry out in the former case.

include/llvm/Analysis/LoopAccessAnalysis.h
570 ↗(On Diff #160448)

This becomes dead?

578 ↗(On Diff #160448)

AND both indicators?

anna added a comment.Aug 15 2018, 5:16 AM

Hi Ayal, thanks for the comments!

The decision how to vectorize invariant stores also deserves attention: LoopVectorizationCostModel::setCostBasedWideningDecision() considers loads from uniform addresses, but not invariant stores - these may end up being scalarized or becoming a scatter; the former is preferred in this case, as the identical scalarized replicas can later be removed.

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

In any case associated cost estimates should be provided to support overall vectorization costs.

agreed.

Note that vectorizing conditional invariant stores deserves special attention. Unconditional invariant stores are candidates to be sunk out of the loop, preferably before trying to vectorize it.

If we get unconditional invariant stores which haven't been sunk out of the loop and it has reached vectorizer, I think we should let the loop vectorizer vectorize it. Irrespective of what other passes such as LICM should have done with store promotion/sinking. See example in https://bugs.llvm.org/show_bug.cgi?id=38546#c1. Even running through clang++ O3 doesn't sink the invariant store out of loop and that store prevents the vectorization of entire loop.

One approach to vectorize a conditional invariant store is to check if its mask is all false, and if not to perform a single invariant scalar store, for lack of a masked-scalar-store instruction. May be worth distinguishing between uniform and divergent conditions; this check is easier to carry out in the former case.

Thanks, I thought these were automatically handled. Will address in updated patch.

include/llvm/Analysis/LoopAccessAnalysis.h
570 ↗(On Diff #160448)

The idea is to retain the identification of storeToLoopInvariantAddress if other passes which use LAA need it.
That's the reason I separated out the StoreToLoopInvariantAddress and NonVectorizableStoreToLoopInvariantAddress.

578 ↗(On Diff #160448)

uh oh. was an older change. will fix.

anna updated this revision to Diff 160882.Aug 15 2018, 12:00 PM
anna marked 2 inline comments as done.

added cost model changes for unpredicated invariant stores. The predicated invariant stores will
generate extra stores here and the cost model also (already) considers the cost of predicated stores.
Since the cost model correctly reflects the cost of the (badly) generated predicated stores,
I've added couple of tests to show that invariant predicated stores are handled correctly, but TODOs
for follow on patch for better code gen.

anna added inline comments.Aug 15 2018, 12:02 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5869 ↗(On Diff #160882)

Predicated uniform stores will fall under this cost model. The next patch will be to address the improved code gen for this case and update the cost model for predicated uniform stores.

anna updated this revision to Diff 161525.Aug 20 2018, 11:56 AM

Teach LAA about non-predicated uniform store. Added test case for these cases
to make sure they are not treated as predicated stores.

Ayal added a comment.Aug 20 2018, 4:06 PM

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

  • - by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)
include/llvm/Analysis/LoopAccessAnalysis.h
638 ↗(On Diff #161525)

Better name it more accurately as, e.g., VariantStoreToLoopInvariantAddress?

570 ↗(On Diff #160448)

OK. But LoopVectorizationLegality below seems to be its only user.

lib/Analysis/LoopAccessAnalysis.cpp
1865 ↗(On Diff #161525)

isLoopInvariantStoreValue ?

1871 ↗(On Diff #161525)

Again, something LICM may have missed?

lib/Transforms/Vectorize/LoopVectorize.cpp
5754 ↗(On Diff #161525)

Can use if (auto *LI = dyn_cast<LoadInst>(I)) {

5763 ↗(On Diff #161525)

Indent

5880 ↗(On Diff #161525)

On certain targets, e.g., skx, an invariant store may end up as a scatter, so setting this decision here to avoid that is important; potentially worthy of a note / a test.

anna marked 4 inline comments as done.Aug 21 2018, 10:03 AM
anna added inline comments.Aug 21 2018, 10:03 AM
include/llvm/Analysis/LoopAccessAnalysis.h
570 ↗(On Diff #160448)

yes, that's right. I made the change, but the analysis has an ORE and there are 5 tests in the LoopAccessAnalysis that are failing because the ORE check "Store to invariant address was [not] found in loop" is missing. See test/Analysis/LoopAccessAnalysis/store-to-invariant-check1.ll where it looks for the presence of "Store to invariant address was found in loop".

I'll remove the code as a follow on clean up and if there's a need for this by other passes that use LAA, folks can add it back when required.

I think it also makes sense to add an ORE for the "VariantStoreToInvariantAddress" as part of this current change.

638 ↗(On Diff #161525)

Okay, I'll change the name.

JFI- Today the changed name (VariantStoreToLoopInvariantAddress) is accurate. However, my plan is to eventually teach the vectorizer about all *safe* uniform stores, not just invariant values stored to invariant address.

So VariantStoreToLoopInvariantAddress can also be vectorized under certain conditions (safe dependence distance calculation for the store versus other memory access). So something like example below can be vectorized [1]:

  for (i=0; i<n;i++)
      for (j=0; j<n; j++) {
             p[i] = b[j]; 
             z[j] += b[j];
      }   
}

However, this cannot be vectorized safely:

  for (i=0; i<n;i++)
      for (j=0; j<n; j++) {
             z[j] = (++p[i]);  <-- dependence distance for uniform store and load is 1.
      }   
}

[1] LICM should try to sink the store out of inner loop, but sometimes it cannot do so because it cannot prove dereferencability for the store address or that the store is guaranteed to execute at least once.

lib/Analysis/LoopAccessAnalysis.cpp
1871 ↗(On Diff #161525)

yes, LICM misses this as well - see added test case in inv_val_store_to_inv_address_conditional_inv.

anna updated this revision to Diff 161787.Aug 21 2018, 11:50 AM
anna marked an inline comment as done.

Addressed review comments, updated ORE message and tests, fixed an assertion failure in cost model calculation for uniform store (bug uncovered when running test
under X86 skylake)

anna marked an inline comment as done.Aug 21 2018, 11:53 AM
anna added inline comments.
include/llvm/Analysis/LoopAccessAnalysis.h
570 ↗(On Diff #160448)

I've made both the changes in this patch since changing the ORE is clearer in one patch.

lib/Transforms/Vectorize/LoopVectorize.cpp
5880 ↗(On Diff #161525)

thanks for bringing this up. It exercised the X86TTIImpl::getMemoryOpCost which showed the bug in my previous diff for LoopVectorizationCostModel::getUniformMemOpCost for uniform store. I was passing in the store's type instead of the store val type.
I've also updated it to use the "unified" interface for load/store just like the other cost model calculations - getGatherScatterCost etc.

anna marked an inline comment as done.Aug 21 2018, 11:56 AM

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

  • - by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)

Right now, I just run instcombine after loop vectorization to clean up those unnecessary stores (and test cases make sure there's only one store left). Looks like there are other places in LV which relies on InstCombine as the clean up pass, so it may not be that bad after all? Thoughts?

hsaito added a subscriber: hsaito.Aug 21 2018, 1:35 PM

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

  • - by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)

Right now, I just run instcombine after loop vectorization to clean up those unnecessary stores (and test cases make sure there's only one store left). Looks like there are other places in LV which relies on InstCombine as the clean up pass, so it may not be that bad after all? Thoughts?

Ideally, each optimizer should generate as clean output IR as it can feasibly do so. Cleaning up this particular "mess" is one of the simpler tasks LV can do on its own.

Ayal added a comment.Aug 22 2018, 2:24 PM

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

  • - by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)

Right now, I just run instcombine after loop vectorization to clean up those unnecessary stores (and test cases make sure there's only one store left). Looks like there are other places in LV which relies on InstCombine as the clean up pass, so it may not be that bad after all? Thoughts?

Yeah, this is a bit embarrassing, but currently invariant loads also get replicated (and cleaned up later), despite trying to avoid doing so by recording IsUniform in VPReplicateRecipe. In general, if it's simpler and more consistent to generate code in a common template and potentially cleanup later, should be ok provided the cost model accounts for it accurately and cleanup is guaranteed, as checked by tests. BTW, LV already has an internal cse(). But in this case, VPlan should reflect the final outcome better, i.e., with a correct IsUniform. This should be taken care of, possibly by a separate patch.

include/llvm/Analysis/LoopAccessAnalysis.h
638 ↗(On Diff #161525)

Yes, in/variant stores to an invariant address may carry cross-iteration dependencies with other loads/store, which could potentially be checked at runtime similar to 'regular' stores. LV supports reductions/inductions if carried by temporaries only, rather than via memory. Such cases should indeed be LICM'd before vectorization - sinking unconditional stores down to a dominated "middle" block, where it's dereferencable and known to have executed at least once.

568 ↗(On Diff #161787)

Update above comment as well: "non-vectorizable stores" >> "of variant values"

lib/Analysis/LoopAccessAnalysis.cpp
1871 ↗(On Diff #161525)

Ahh, but a phi of invariant values is invariant iff the compares that decide which predecessor will reach the phi, are also invariant. In inv_val_store_to_inv_address_conditional_inv this holds because there %cmp determines which predecessor it'll be, and %cmp is invariant. In general Divergence Analysis is designed to provide this answer, as in D50433's isUniform().

lib/Transforms/Vectorize/LoopVectorize.cpp
5880 ↗(On Diff #161525)

very good

5766 ↗(On Diff #161787)

Should be consistent and use the same isLoopInvariantStoreValue() noted above.

5870 ↗(On Diff #161787)

We expect here that isa<LoadInst>(&I) || isa<StoreInst>(&I) (as memoryInstructionCanBeWidened() will assert below) having checked getLoadStorePointerOperand(&I) above.

anna marked 3 inline comments as done.Aug 23 2018, 8:36 AM
anna added inline comments.
lib/Analysis/LoopAccessAnalysis.cpp
1871 ↗(On Diff #161525)

yes, that's right. Note that this patch handles phis of invariant values based on either an invariant condition or a variant condition (see inv_val_store_to_inv_address_conditional_diff_values_ic where the phi result is based on a varying condition).

The improved codegen and cost model handling is for predicated stores, where the block containing the invariant store is to be predicated. Today, we just handle this as a "predicated store" cost and generate the code gen accordingly.

anna added a comment.Aug 23 2018, 8:37 AM

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

  • - by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)

Right now, I just run instcombine after loop vectorization to clean up those unnecessary stores (and test cases make sure there's only one store left). Looks like there are other places in LV which relies on InstCombine as the clean up pass, so it may not be that bad after all? Thoughts?

Yeah, this is a bit embarrassing, but currently invariant loads also get replicated (and cleaned up later), despite trying to avoid doing so by recording IsUniform in VPReplicateRecipe. In general, if it's simpler and more consistent to generate code in a common template and potentially cleanup later, should be ok provided the cost model accounts for it accurately and cleanup is guaranteed, as checked by tests. BTW, LV already has an internal cse(). But in this case, VPlan should reflect the final outcome better, i.e., with a correct IsUniform. This should be taken care of, possibly by a separate patch.

I see. thanks for the clarification. So, for now, I'll leave the stores in the IR just like we're doing for the loads and add a "TODO" for both.

anna updated this revision to Diff 162206.Aug 23 2018, 9:39 AM

address review comments (NFC wrt previous diff). Added one test for varying value stored into invariant address.

anna updated this revision to Diff 162212.Aug 23 2018, 9:59 AM

Added TODOs for better code gen of predicated uniform store and removing redundant loads and stores left behind during
scalarization of these uniform loads and stores.

Ayal added inline comments.Aug 23 2018, 12:08 PM
lib/Analysis/LoopAccessAnalysis.cpp
1871 ↗(On Diff #161525)

So the suggested isLoopInvariantStoreValue name is incorrect, as the store value may be variant. What's special about these variant values - why not handle any store value?

Yes, conditional stores to an invariant address will end up scalarized and predicated, i.e., with a branch-and-store per lane, which is quite inefficient. A masked scatter may work better there, until optimized by a single branch-and-store if any lane is masked-on (invariant stored value) or single branch-and-store of last masked-on lane (in/variant stored value).

lib/Transforms/Vectorize/LoopVectorize.cpp
5766 ↗(On Diff #161787)

This is still inconsistent with the OR-operands-are-invariant above.

anna added inline comments.Aug 23 2018, 12:32 PM
lib/Analysis/LoopAccessAnalysis.cpp
1871 ↗(On Diff #161525)

I am currently adding the support for *any variant* stores to invariant address. The unsafe cross iteration dependencies are identified through LAA: unsafe dependent memory operations in loop. This was identified without any changes required from my side to the LAA memory conflict detection. However, I'm not sure if LAA handles all cases exhaustively.

The reason I started with this sub-patch is that when the stored value is not a varying memory access from within the loop (that's what this blob of code is really trying to do - see isLoopInvariant and hasLoopInvariantOperands), we don't need to reason about whether LAA handles all memory conflict detection.

When the stored value can be any variant value, we need to make sure that the LAA pass handles all the memory conflicts correctly or update LAA if that isn't the case.

lib/Transforms/Vectorize/LoopVectorize.cpp
5766 ↗(On Diff #161787)

will update.

anna added inline comments.Aug 23 2018, 1:27 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5766 ↗(On Diff #161787)

actually, this is correct. We don't need to update it to the "incorrectly named" lambda above.

We need to do an extract if the value is not invariant: example case:

for.body:                                         ; preds = %for.body, %entry
  %i = phi i64 [ %i.next, %latch ], [ 0, %entry ]
  %tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
  %tmp2 = load i32, i32* %tmp1, align 8
  %varying_cmp = icmp eq i32 %tmp2, %k
  store i32 %ntrunc, i32* %tmp1
  br i1 %varying_cmp, label %cond_store, label %cond_store_k

cond_store:
  br label %latch

cond_store_k:
  br label %latch

latch:
  %storeval = phi i32 [ %ntrunc, %cond_store ], [ %k, %cond_store_k ]
  store i32 %storeval, i32* %a <-- uniform store

storeval's operands are invariant, but the value being chosen in each iteration of the loop varies based on %varying_cmp. In this case, we need an extract and then the scalar store. That's exactly what we do as well.

anna added a comment.Aug 24 2018, 7:00 AM

okay, to keep this patch true to the original intent and commit message: I'm going to change it to handle just the store of invariant values to invariant addresses (i.e. no support for OR-operands-are-invariant). It will be admittedly a more conservative patch. The ORE message will also reflect correctly the "variant stores to invariant addresses".

Also, the more general patch which is in progress is to handle the store of any (in/variant) value into invariant address. It requires handling of a UB case: when user incorrectly annotates a loop which has memory conflicts as parallel.loop and the vectorizer vectorizes the loop with store to uniform address (but the loop has a memory conflict). There was a bug fixed a while back: https://bugs.llvm.org/show_bug.cgi?id=15794#c4
As part of this more general patch, the ORE about uniform stores will also be removed, since we don't seem to need it (or we can keep the original code around if needed).

anna added a comment.Aug 24 2018, 10:23 AM

One more interesting thing I noticed while adding predicated invariant stores to X86 (for -mcpu=skylake-avx512), it supports masked scatter for non-unniform stores.
But we need to add support for uniform stores along with this patch. Today, it just generates incorrect code (no predication whatsover).
For other architectures that do not have these masked intrinsics, we just generate the predicated store by doing an extract and branch on each lane (correct but inefficient and will be avoided unless -force-vector-width=X).

anna planned changes to this revision.Aug 24 2018, 10:26 AM

see comment above for masked scatter support.

One more interesting thing I noticed while adding predicated invariant stores to X86 (for -mcpu=skylake-avx512), it supports masked scatter for non-unniform stores.
But we need to add support for uniform stores along with this patch. Today, it just generates incorrect code (no predication whatsover).
For other architectures that do not have these masked intrinsics, we just generate the predicated store by doing an extract and branch on each lane (correct but inefficient and will be avoided unless -force-vector-width=X).

In general, self output dependence is fine to vectorize (whether the store address is uniform or random), as long as (masked) scatter (or scatter emulation) happens from lower elements to higher elements. Intel's scatter instruction is implemented in that way, and so is CG Prepare's serialization of masked scatter intrinsic. When we check for TTI based availability/cost, we need to ensure that the HW scatter support satisfies this ordering requirement since some scatter implementations may not.

anna added a comment.Aug 24 2018, 12:47 PM

One more interesting thing I noticed while adding predicated invariant stores to X86 (for -mcpu=skylake-avx512), it supports masked scatter for non-unniform stores.
But we need to add support for uniform stores along with this patch. Today, it just generates incorrect code (no predication whatsover).
For other architectures that do not have these masked intrinsics, we just generate the predicated store by doing an extract and branch on each lane (correct but inefficient and will be avoided unless -force-vector-width=X).

In general, self output dependence is fine to vectorize (whether the store address is uniform or random), as long as (masked) scatter (or scatter emulation) happens from lower elements to higher elements.

I don't think the above comment matters for uniform addresses because a uniform address is invariant. This is what the langref states for scatter intrinsic (https://llvm.org/docs/LangRef.html#id1792):

. The data stored in memory is a vector of any integer, floating-point or pointer data type. Each vector element is stored in an arbitrary memory address. Scatter with overlapping addresses is guaranteed to be ordered from least-significant to most-significant element.

The scatter address is not overlapping for the uniform address. It is the exact same address. This is the code that gets generated for uniform stores on skylake with AVX-512 support once I fixed the bug in this patch (the scatter location is the same address and the stored value is also the same, and the mask is the vector of booleans):
pseudo code:

if (b[i] ==k)
  a = ntrunc; <-- uniform store based on condition above.

IR generated:

vector.ph:
  %broadcast.splatinsert5 = insertelement <16 x i32> undef, i32 %k, i32 0
  %broadcast.splat6 = shufflevector <16 x i32> %broadcast.splatinsert5, <16 x i32> undef, <16 x i32> zeroinitializer <-- vector splat of k
  %broadcast.splatinsert9 = insertelement <16 x i32*> undef, i32* %a, i32 0
  %broadcast.splat10 = shufflevector <16 x i32*> %broadcast.splatinsert9, <16 x i32*> undef, <16 x i32> zeroinitializer <-- vector splat of i32* a.

vector.body:
 %2 = getelementptr inbounds i32, i32* %b, i64 %index
  %3 = bitcast i32* %2 to <16 x i32>*
  %wide.load = load <16 x i32>, <16 x i32>* %3, align 8
  %4 = icmp eq <16 x i32> %wide.load, %broadcast.splat6
call void @llvm.masked.scatter.v16i32.v16p0i32(<16 x i32> %broadcast.splat8, <16 x i32*> %broadcast.splat10, i32 4, <16 x i1> %4) <--scatter storing the same element into the same address (a), depending on same condition b[i] == k

For other architectures that do not have these masked intrinsics, we just generate the predicated store by doing an extract and branch on each lane (correct but inefficient and will be avoided unless -force-vector-width=X).

In general, self output dependence is fine to vectorize (whether the store address is uniform or random), as long as (masked) scatter (or scatter emulation) happens from lower elements to higher elements.

I don't think the above comment matters for uniform addresses because a uniform address is invariant.

Only if you are storing uniform value.

This is what the langref states for scatter intrinsic (https://llvm.org/docs/LangRef.html#id1792):

. The data stored in memory is a vector of any integer, floating-point or pointer data type. Each vector element is stored in an arbitrary memory address. Scatter with overlapping addresses is guaranteed to be ordered from least-significant to most-significant element.

Thanks for reminding me that the intrinsic is defined with the ordering requirement.

We should also consider doing this, depending on the cost of branch versus masked scatter. For the targets w/o masked scatter, this should be better than masked scatter emulation.

%5 = bitcast <16xi1> %4 to <i16>
%6 = icmp eq <i16> %5, <i16> zero
br <i1> %6 skip fall
fall:
store <i32> %ntrunc, <i32*> %a
br skip
skip:

anna added a comment.Aug 24 2018, 1:45 PM

We should also consider doing this, depending on the cost of branch versus masked scatter. For the targets w/o masked scatter, this should be better than masked scatter emulation.

%5 = bitcast <16xi1> %4 to <i16>
%6 = icmp eq <i16> %5, <i16> zero
br <i1> %6 skip fall
fall:
store <i32> %ntrunc, <i32*> %a
br skip
skip:

Yes, that is the improved codegen stated as TODO in the costmodel. Today both the costmodel and the code gen will identify it as a normal predicated store: series of branches and stores. Also, we need to differentiate these 2 cases:

if(b[i] ==k)
 a = ntrunc;

versus

if(b[i] ==k)
  a = ntrunc;
else
  a = m;

The second example should be converted into a vector-select based on b[i] == k and the last element will be extracted out of the vector select and stored into a.
However, if for some reason, it is not converted into a select and just left as 2 predicated stores, it is incorrect to use the same code transformation as we'll do for the first example. For the first example, we see if all values in the conditional is false, and we skip the store. In the second case, we need to store a value, but that value is just decided by the last element of the conditional. Just 2 different forms of predicated stores.

Yes, that is the improved codegen stated as TODO in the costmodel.

Aha. OK. Thanks for the clarification.

Ayal added a comment.Aug 26 2018, 9:29 AM

This is what the langref states for scatter intrinsic (https://llvm.org/docs/LangRef.html#id1792):

. The data stored in memory is a vector of any integer, floating-point or pointer data type. Each vector element is stored in an arbitrary memory address. Scatter with overlapping addresses is guaranteed to be ordered from least-significant to most-significant element.

Yes, this was intentional, precisely to support vectorization of (possibly) self-overwriting stores.

...
This is the code that gets generated for uniform stores on skylake with AVX-512 support once I fixed the bug in this patch
...

LGTM.
Indeed, care must be taken to avoid using more than one masked scatter to the same invariant address; but LAA should flag such non-self cross-iteration dependencies.

include/llvm/Analysis/LoopAccessAnalysis.h
570 ↗(On Diff #162212)

Could rename VariantStoreToLoopInvariantAddress to HasVariantStoreToLoopInvariantAddress.

lib/Analysis/LoopAccessAnalysis.cpp
1871 ↗(On Diff #161525)

It is conceivable that stores of invariant values to invariant addresses can participate in a subset of unsafe scenarios, which may be easier for LAA to detect, and thus start by treating only stores of invariant values to invariant addresses. But storing a variant phi whose "dominating" compares are not all invariant, could conceptually produce arbitrary variant values and dependencies, despite having invariant values for all other operands of the phi; e.g., 0 and -1. Presumably, this case does not differ, from LAA perspective, from stores of any variant value to invariant address.

lib/Transforms/Vectorize/LoopVectorize.cpp
5766 ↗(On Diff #161787)

Agreed. Misled by the erroneous isLoopInvariantStoreValue() name.

anna updated this revision to Diff 163328.Aug 30 2018, 7:51 AM

added test for conditional uniform store for AVX512. Rebased over fix in D51313.

anna added a comment.Aug 30 2018, 7:52 AM

This patch now only vectorizes invariant values stored into invariant addresses. It also correctly handles conditionally executed stores (fixed bug for scatter code generation in AVX512).

anna updated this revision to Diff 164667.Sep 10 2018, 7:17 AM

rebased over D51313.

Ayal added a comment.Sep 12 2018, 5:25 AM

Best allow only a single store to an invariant address for now; until we're sure the last one to store is always identified correctly.

include/llvm/Analysis/LoopAccessAnalysis.h
568 ↗(On Diff #164667)
/// Checks existence of stores to invariant address inside loop.
/// If such stores exist, checks if those are stores of variant values.

can be updated and simplified into something like

/// If the loop has any store of a variant value to an invariant address, then return true, else return false.
lib/Analysis/LoopAccessAnalysis.cpp
1869 ↗(On Diff #164667)

How about isUniform(Ptr) && !isUniform(ST->getValueOperand()) ?
Relying more consistently on SCEV to determine invariance of both address and stored value. Is there a reason for treating stored value more conservatively, checking its invariance by asking if it's outside the loop?

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
780 ↗(On Diff #164667)

update the message as well: "write of variant value to a loop invariant address ..."

lib/Transforms/Vectorize/LoopVectorize.cpp
5890 ↗(On Diff #164667)

Complementing the consistent use of isUniform rather than isLoopInvariant:
bool isLoopInvariantStoreValue = Legal->isUniform(SI->getValueOperand()); ?
, similar to the way the address is checked to be uniform before calling this method below.

6008 ↗(On Diff #164667)

Comment can be simplified to something like

// TODO: Avoid replicating loads and stores instead of
// relying on instcombine to remove them.
test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
16 ↗(On Diff #164667)

Have one space instead of two between i32 and %ntrunc on the check-not'd store. Easier to see that this checks for a single copy of the store, i.e., that instcombine eliminated all redundant copies. May want to comment what this test is designed to check.

131 ↗(On Diff #164667)

inv_val_load_to?

test/Transforms/LoopVectorize/invariant-store-vectorization.ll
11 ↗(On Diff #164667)

"that check whether" >> "check that"
("whether" usually comes with an "or not")

80 ↗(On Diff #164667)

"as identifying these" >> "identify them"
Do we check what the cost model identifies?

132 ↗(On Diff #164667)

Hmm, multiple stores to the same invariant address did not trigger LAI memory dependence checks(?)
This may generate wrong code if the conditional scalarized stores are emitted in the wrong order, or if a pair of masked scatters are used.

151 ↗(On Diff #164667)

good to continue CHECKing that EE1 is used in a branch that guards a store of %ntrunc to %a.

182 ↗(On Diff #164667)

Now the store/s is/are no longer of invariant value/s.

183 ↗(On Diff #164667)

.. once we support vectorizing stores of variant values to invariant addresses

219 ↗(On Diff #164667)

.. efficiently once divergence analysis identifies storeval as uniform

252 ↗(On Diff #164667)

"even though it's" >> "once we support"

test/Transforms/LoopVectorize/pr31190.ll
33 ↗(On Diff #164667)

CHECK vectorized code emitted, or debug info stating it can be vectorized?

anna marked 14 inline comments as done.Sep 18 2018, 12:16 PM

Hi Ayal, thanks for your detailed review!

Best allow only a single store to an invariant address for now; until we're sure the last one to store is always identified correctly.

I've updated the patch to restrict to this case for now (diff coming up soon). Generally, if we have multiple stores to an invariant address, it might be canonicalized by InstCombine. So, this may not be as inhibiting as it sounds. Keeping this restriction and allowing "variant stores to invariant addresses" seems like a logical next step once this lands.

lib/Analysis/LoopAccessAnalysis.cpp
1869 ↗(On Diff #164667)

Nothing specific. This works as well. I've changed it. As a separate change, we'll need to improve isUniform because they consider uniform FP values are non-uniform (since FP is non-scevable).

test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
131 ↗(On Diff #164667)

updated name.

test/Transforms/LoopVectorize/invariant-store-vectorization.ll
80 ↗(On Diff #164667)

since we dont have debug statements for what the cost model identifies this, I've updated the above comment.

132 ↗(On Diff #164667)

good point - as stated in comment earlier, I will restrict to one store to invariant address for now.

219 ↗(On Diff #164667)

once we relax the check of variant/invariant value being stored, it does not matter if we correctly identify if it is variant or invariant. So, I think divergence analysis is not required.

anna updated this revision to Diff 166025.Sep 18 2018, 1:36 PM
anna marked 3 inline comments as done.

addressed review comments.

Ayal accepted this revision.Sep 24 2018, 2:52 PM

Thanks for taking care of everything, this LGTM now, added only a few minor optional comments.

lib/Analysis/LoopAccessAnalysis.cpp
1883 ↗(On Diff #166025)

Maybe clearer to do

if (isUniform(Ptr)) {
  // Consider multiple stores to the same uniform address as a store of a variant value.
  bool MultipleStoresToUniformPtr = UniformStores.insert(Ptr).second;
  HasVariantStoreToLoopInvariantAddress |= (!isUniform(ST->getValueOperand()) || MultipleStoresToUniformPtr);
}

Note that supporting a single store of a variant value to an invariant address is easier than supporting multiple (conditional) stores of invariant values to an invariant address, as discussed. So the two conditions should probably be separated when the patch taking care of the former is introduced.

lib/Transforms/Vectorize/LoopVectorize.cpp
1180 ↗(On Diff #166025)

The ": extract of last element" part is for future use, when stores of variant values to invariant addresses are supported, right? Best leave this part to that future patch, or add a TODO to test this extra cost then.

5422 ↗(On Diff #166025)

No need to add these enclosing curly brackets.

test/Transforms/LoopVectorize/invariant-store-vectorization.ll
219 ↗(On Diff #164667)

ok, works both ways - once we leverage divergence analysis we'll be able to handle such a store of uniform/invariant value, w/o needing relaxed support for stores of variant values.

This revision is now accepted and ready to land.Sep 24 2018, 2:52 PM
anna marked 3 inline comments as done.Sep 25 2018, 8:30 AM
anna added inline comments.
lib/Analysis/LoopAccessAnalysis.cpp
1883 ↗(On Diff #166025)

done. Should be bool MultipleStoresToUniformPtr = !UniformStores.insert(Ptr).second;

This revision was automatically updated to reflect the committed changes.