This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Analysis/
-
llvm/
-
Analysis/
6/13
LoopAccessAnalysis.h
-
lib/
-
Analysis/
3/12
LoopAccessAnalysis.cpp
-
Transforms/Vectorize/
-
Vectorize/
1/1
LoopVectorizationLegality.cpp
9/16
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
5/12
invariant-store-vectorization.ll
1/1
pr31190.ll

Differential D50665

[LV][LAA] Vectorize loop invariant values stored into loop invariant address
ClosedPublic

Authored by anna on Aug 13 2018, 2:29 PM.

Download Raw Diff

Details

Reviewers

anemet
Ayal
mkuper
mssimpso

Commits

rGb1e3d4531826: [LV][LAA] Vectorize loop invariant values stored into loop invariant address
rL343028: [LV][LAA] Vectorize loop invariant values stored into loop invariant address

Summary

We are overly conservative in loop vectorizer with respect to stores to loop
invariant addresses.
More details in https://bugs.llvm.org/show_bug.cgi?id=38546
This is the first part of the fix where we start with vectorizing loop invariant
values to loop invariant addresses.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 21679
Build 21679: arc lint + arc unit

Event Timeline

anna created this revision.Aug 13 2018, 2:29 PM

Harbormaster completed remote builds in B21411: Diff 160448.Aug 13 2018, 2:29 PM

The decision how to vectorize invariant stores also deserves attention: LoopVectorizationCostModel::setCostBasedWideningDecision() considers loads from uniform addresses, but not invariant stores - these may end up being scalarized or becoming a scatter; the former is preferred in this case, as the identical scalarized replicas can later be removed. In any case associated cost estimates should be provided to support overall vectorization costs. Note that vectorizing conditional invariant stores deserves special attention. Unconditional invariant stores are candidates to be sunk out of the loop, preferably before trying to vectorize it. One approach to vectorize a conditional invariant store is to check if its mask is all false, and if not to perform a single invariant scalar store, for lack of a masked-scalar-store instruction. May be worth distinguishing between uniform and divergent conditions; this check is easier to carry out in the former case.

include/llvm/Analysis/LoopAccessAnalysis.h
570	This becomes dead?
578	AND both indicators?

In D50665#1199777, @Ayal wrote:

Hi Ayal, thanks for the comments!

The decision how to vectorize invariant stores also deserves attention: LoopVectorizationCostModel::setCostBasedWideningDecision() considers loads from uniform addresses, but not invariant stores - these may end up being scalarized or becoming a scatter; the former is preferred in this case, as the identical scalarized replicas can later be removed.

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

In any case associated cost estimates should be provided to support overall vectorization costs.

agreed.

Note that vectorizing conditional invariant stores deserves special attention. Unconditional invariant stores are candidates to be sunk out of the loop, preferably before trying to vectorize it.

If we get unconditional invariant stores which haven't been sunk out of the loop and it has reached vectorizer, I think we should let the loop vectorizer vectorize it. Irrespective of what other passes such as LICM should have done with store promotion/sinking. See example in https://bugs.llvm.org/show_bug.cgi?id=38546#c1. Even running through clang++ O3 doesn't sink the invariant store out of loop and that store prevents the vectorization of entire loop.

One approach to vectorize a conditional invariant store is to check if its mask is all false, and if not to perform a single invariant scalar store, for lack of a masked-scalar-store instruction. May be worth distinguishing between uniform and divergent conditions; this check is easier to carry out in the former case.

Thanks, I thought these were automatically handled. Will address in updated patch.

include/llvm/Analysis/LoopAccessAnalysis.h
570	The idea is to retain the identification of `storeToLoopInvariantAddress` if other passes which use LAA need it. That's the reason I separated out the `StoreToLoopInvariantAddress` and `NonVectorizableStoreToLoopInvariantAddress`.
578	uh oh. was an older change. will fix.

added cost model changes for unpredicated invariant stores. The predicated invariant stores will
generate extra stores here and the cost model also (already) considers the cost of predicated stores.
Since the cost model correctly reflects the cost of the (badly) generated predicated stores,
I've added couple of tests to show that invariant predicated stores are handled correctly, but TODOs
for follow on patch for better code gen.

Harbormaster completed remote builds in B21535: Diff 160882.Aug 15 2018, 12:00 PM

anna added inline comments.Aug 15 2018, 12:02 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5873	Predicated uniform stores will fall under this cost model. The next patch will be to address the improved code gen for this case and update the cost model for predicated uniform stores.

ping

Herald added a subscriber: rkruppe. · View Herald TranscriptAug 20 2018, 8:53 AM

Teach LAA about non-predicated uniform store. Added test case for these cases
to make sure they are not treated as predicated stores.

Harbormaster completed remote builds in B21679: Diff 161525.Aug 20 2018, 11:57 AM

In D50665#1200509, @anna wrote:

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

- by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)

include/llvm/Analysis/LoopAccessAnalysis.h
570	OK. But LoopVectorizationLegality below seems to be its only user.
638	Better name it more accurately as, e.g., `VariantStoreToLoopInvariantAddress`?
lib/Analysis/LoopAccessAnalysis.cpp
1865	`isLoopInvariantStoreValue` ?
1871	Again, something LICM may have missed?
lib/Transforms/Vectorize/LoopVectorize.cpp
5754	Can use `if (auto *LI = dyn_cast<LoadInst>(I)) {`
5763	Indent
5880	On certain targets, e.g., skx, an invariant store may end up as a scatter, so setting this decision here to avoid that is important; potentially worthy of a note / a test.

anna marked 4 inline comments as done.Aug 21 2018, 10:03 AM

anna added inline comments.Aug 21 2018, 10:03 AM

include/llvm/Analysis/LoopAccessAnalysis.h
570	yes, that's right. I made the change, but the analysis has an ORE and there are 5 tests in the LoopAccessAnalysis that are failing because the ORE check "Store to invariant address was [not] found in loop" is missing. See test/Analysis/LoopAccessAnalysis/store-to-invariant-check1.ll where it looks for the presence of "Store to invariant address was found in loop". I'll remove the code as a follow on clean up and if there's a need for this by other passes that use LAA, folks can add it back when required. I think it also makes sense to add an ORE for the "VariantStoreToInvariantAddress" as part of this current change.
638	Okay, I'll change the name. JFI- Today the changed name (VariantStoreToLoopInvariantAddress) is accurate. However, my plan is to eventually teach the vectorizer about all safe uniform stores, not just invariant values stored to invariant address. So `VariantStoreToLoopInvariantAddress` can also be vectorized under certain conditions (safe dependence distance calculation for the store versus other memory access). So something like example below can be vectorized [1]: for (i=0; i<n;i++) for (j=0; j<n; j++) { p[i] = b[j]; z[j] += b[j]; } } However, this cannot be vectorized safely: for (i=0; i<n;i++) for (j=0; j<n; j++) { z[j] = (++p[i]); <-- dependence distance for uniform store and load is 1. } } [1] LICM should try to sink the store out of inner loop, but sometimes it cannot do so because it cannot prove dereferencability for the store address or that the store is guaranteed to execute at least once.
lib/Analysis/LoopAccessAnalysis.cpp
1871	yes, LICM misses this as well - see added test case in `inv_val_store_to_inv_address_conditional_inv`.

Addressed review comments, updated ORE message and tests, fixed an assertion failure in cost model calculation for uniform store (bug uncovered when running test
under X86 skylake)

Harbormaster completed remote builds in B21741: Diff 161787.Aug 21 2018, 11:50 AM

anna marked an inline comment as done.Aug 21 2018, 11:53 AM

anna added inline comments.

include/llvm/Analysis/LoopAccessAnalysis.h
570	I've made both the changes in this patch since changing the ORE is clearer in one patch.
lib/Transforms/Vectorize/LoopVectorize.cpp
5880	thanks for bringing this up. It exercised the `X86TTIImpl::getMemoryOpCost` which showed the bug in my previous diff for `LoopVectorizationCostModel::getUniformMemOpCost` for uniform store. I was passing in the store's type instead of the store val type. I've also updated it to use the "unified" interface for load/store just like the other cost model calculations - `getGatherScatterCost` etc.

In D50665#1206780, @Ayal wrote:

In D50665#1200509, @anna wrote:

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

- by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)

Right now, I just run instcombine after loop vectorization to clean up those unnecessary stores (and test cases make sure there's only one store left). Looks like there are other places in LV which relies on InstCombine as the clean up pass, so it may not be that bad after all? Thoughts?

In D50665#1208026, @anna wrote:

In D50665#1206780, @Ayal wrote:

In D50665#1200509, @anna wrote:

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

- by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)

Right now, I just run instcombine after loop vectorization to clean up those unnecessary stores (and test cases make sure there's only one store left). Looks like there are other places in LV which relies on InstCombine as the clean up pass, so it may not be that bad after all? Thoughts?

Ideally, each optimizer should generate as clean output IR as it can feasibly do so. Cleaning up this particular "mess" is one of the simpler tasks LV can do on its own.

In D50665#1208026, @anna wrote:

In D50665#1206780, @Ayal wrote:

In D50665#1200509, @anna wrote:

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

- by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)

Right now, I just run instcombine after loop vectorization to clean up those unnecessary stores (and test cases make sure there's only one store left). Looks like there are other places in LV which relies on InstCombine as the clean up pass, so it may not be that bad after all? Thoughts?

Yeah, this is a bit embarrassing, but currently invariant loads also get replicated (and cleaned up later), despite trying to avoid doing so by recording IsUniform in VPReplicateRecipe. In general, if it's simpler and more consistent to generate code in a common template and potentially cleanup later, should be ok provided the cost model accounts for it accurately and cleanup is guaranteed, as checked by tests. BTW, LV already has an internal cse(). But in this case, VPlan should reflect the final outcome better, i.e., with a correct IsUniform. This should be taken care of, possibly by a separate patch.

include/llvm/Analysis/LoopAccessAnalysis.h
572	Update above comment as well: "non-vectorizable stores" >> "of variant values"
638	Yes, in/variant stores to an invariant address may carry cross-iteration dependencies with other loads/store, which could potentially be checked at runtime similar to 'regular' stores. LV supports reductions/inductions if carried by temporaries only, rather than via memory. Such cases should indeed be LICM'd before vectorization - sinking unconditional stores down to a dominated "middle" block, where it's dereferencable and known to have executed at least once.
lib/Analysis/LoopAccessAnalysis.cpp
1871	Ahh, but a phi of invariant values is invariant iff the compares that decide which predecessor will reach the phi, are also invariant. In `inv_val_store_to_inv_address_conditional_inv` this holds because there `%cmp` determines which predecessor it'll be, and `%cmp` is invariant. In general Divergence Analysis is designed to provide this answer, as in D50433's `isUniform()`.
lib/Transforms/Vectorize/LoopVectorize.cpp
5762	Should be consistent and use the same `isLoopInvariantStoreValue()` noted above.
5876	We expect here that `isa<LoadInst>(&I) \|\| isa<StoreInst>(&I)` (as `memoryInstructionCanBeWidened()` will assert below) having checked `getLoadStorePointerOperand(&I)` above.
5880	very good

anna marked 3 inline comments as done.Aug 23 2018, 8:36 AM

anna added inline comments.

lib/Analysis/LoopAccessAnalysis.cpp
1871	yes, that's right. Note that this patch handles phis of invariant values based on either an invariant condition or a variant condition (see `inv_val_store_to_inv_address_conditional_diff_values_ic` where the phi result is based on a varying condition). The improved codegen and cost model handling is for predicated stores, where the block containing the invariant store is to be predicated. Today, we just handle this as a "predicated store" cost and generate the code gen accordingly.

In D50665#1209958, @Ayal wrote:

In D50665#1208026, @anna wrote:

In D50665#1206780, @Ayal wrote:

In D50665#1200509, @anna wrote:

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

- by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)

Right now, I just run instcombine after loop vectorization to clean up those unnecessary stores (and test cases make sure there's only one store left). Looks like there are other places in LV which relies on InstCombine as the clean up pass, so it may not be that bad after all? Thoughts?

Yeah, this is a bit embarrassing, but currently invariant loads also get replicated (and cleaned up later), despite trying to avoid doing so by recording IsUniform in VPReplicateRecipe. In general, if it's simpler and more consistent to generate code in a common template and potentially cleanup later, should be ok provided the cost model accounts for it accurately and cleanup is guaranteed, as checked by tests. BTW, LV already has an internal cse(). But in this case, VPlan should reflect the final outcome better, i.e., with a correct IsUniform. This should be taken care of, possibly by a separate patch.

I see. thanks for the clarification. So, for now, I'll leave the stores in the IR just like we're doing for the loads and add a "TODO" for both.

anna mentioned this in D50925: [LICM] Hoist stores of invariant values to invariant addresses out of loops.Aug 23 2018, 8:54 AM

address review comments (NFC wrt previous diff). Added one test for varying value stored into invariant address.

Harbormaster completed remote builds in B21836: Diff 162206.Aug 23 2018, 9:39 AM

Added TODOs for better code gen of predicated uniform store and removing redundant loads and stores left behind during
scalarization of these uniform loads and stores.

Harbormaster completed remote builds in B21837: Diff 162212.Aug 23 2018, 9:59 AM

Ayal added inline comments.Aug 23 2018, 12:08 PM

lib/Analysis/LoopAccessAnalysis.cpp
1871	So the suggested `isLoopInvariantStoreValue` name is incorrect, as the store value may be variant. What's special about these variant values - why not handle any store value? Yes, conditional stores to an invariant address will end up scalarized and predicated, i.e., with a branch-and-store per lane, which is quite inefficient. A masked scatter may work better there, until optimized by a single branch-and-store if any lane is masked-on (invariant stored value) or single branch-and-store of last masked-on lane (in/variant stored value).
lib/Transforms/Vectorize/LoopVectorize.cpp
5762	This is still inconsistent with the OR-operands-are-invariant above.

anna added inline comments.Aug 23 2018, 12:32 PM

lib/Analysis/LoopAccessAnalysis.cpp
1871	I am currently adding the support for any variant stores to invariant address. The unsafe cross iteration dependencies are identified through `LAA: unsafe dependent memory operations in loop`. This was identified without any changes required from my side to the LAA memory conflict detection. However, I'm not sure if LAA handles all cases exhaustively. The reason I started with this sub-patch is that when the stored value is not a varying memory access from within the loop (that's what this blob of code is really trying to do - see `isLoopInvariant` and `hasLoopInvariantOperands`), we don't need to reason about whether LAA handles all memory conflict detection. When the stored value can be any variant value, we need to make sure that the LAA pass handles all the memory conflicts correctly or update LAA if that isn't the case.
lib/Transforms/Vectorize/LoopVectorize.cpp
5762	will update.

anna added inline comments.Aug 23 2018, 1:27 PM

lib/Transforms/Vectorize/LoopVectorize.cpp

5762

actually, this is correct. We don't need to update it to the "incorrectly named" lambda above.

We need to do an extract if the value is not invariant: example case:

for.body:                                         ; preds = %for.body, %entry
  %i = phi i64 [ %i.next, %latch ], [ 0, %entry ]
  %tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
  %tmp2 = load i32, i32* %tmp1, align 8
  %varying_cmp = icmp eq i32 %tmp2, %k
  store i32 %ntrunc, i32* %tmp1
  br i1 %varying_cmp, label %cond_store, label %cond_store_k

cond_store:
  br label %latch

cond_store_k:
  br label %latch

latch:
  %storeval = phi i32 [ %ntrunc, %cond_store ], [ %k, %cond_store_k ]
  store i32 %storeval, i32* %a <-- uniform store

storeval's operands are invariant, but the value being chosen in each iteration of the loop varies based on %varying_cmp. In this case, we need an extract and then the scalar store. That's exactly what we do as well.

okay, to keep this patch true to the original intent and commit message: I'm going to change it to handle just the store of invariant values to invariant addresses (i.e. no support for OR-operands-are-invariant). It will be admittedly a more conservative patch. The ORE message will also reflect correctly the "variant stores to invariant addresses".

Also, the more general patch which is in progress is to handle the store of any (in/variant) value into invariant address. It requires handling of a UB case: when user incorrectly annotates a loop which has memory conflicts as parallel.loop and the vectorizer vectorizes the loop with store to uniform address (but the loop has a memory conflict). There was a bug fixed a while back: https://bugs.llvm.org/show_bug.cgi?id=15794#c4
As part of this more general patch, the ORE about uniform stores will also be removed, since we don't seem to need it (or we can keep the original code around if needed).

One more interesting thing I noticed while adding predicated invariant stores to X86 (for -mcpu=skylake-avx512), it supports masked scatter for non-unniform stores.
But we need to add support for uniform stores along with this patch. Today, it just generates incorrect code (no predication whatsover).
For other architectures that do not have these masked intrinsics, we just generate the predicated store by doing an extract and branch on each lane (correct but inefficient and will be avoided unless -force-vector-width=X).

see comment above for masked scatter support.

In D50665#1212597, @anna wrote:

One more interesting thing I noticed while adding predicated invariant stores to X86 (for -mcpu=skylake-avx512), it supports masked scatter for non-unniform stores.
But we need to add support for uniform stores along with this patch. Today, it just generates incorrect code (no predication whatsover).
For other architectures that do not have these masked intrinsics, we just generate the predicated store by doing an extract and branch on each lane (correct but inefficient and will be avoided unless -force-vector-width=X).

In general, self output dependence is fine to vectorize (whether the store address is uniform or random), as long as (masked) scatter (or scatter emulation) happens from lower elements to higher elements. Intel's scatter instruction is implemented in that way, and so is CG Prepare's serialization of masked scatter intrinsic. When we check for TTI based availability/cost, we need to ensure that the HW scatter support satisfies this ordering requirement since some scatter implementations may not.

In D50665#1212637, @hsaito wrote:

In D50665#1212597, @anna wrote:

One more interesting thing I noticed while adding predicated invariant stores to X86 (for -mcpu=skylake-avx512), it supports masked scatter for non-unniform stores.
But we need to add support for uniform stores along with this patch. Today, it just generates incorrect code (no predication whatsover).
For other architectures that do not have these masked intrinsics, we just generate the predicated store by doing an extract and branch on each lane (correct but inefficient and will be avoided unless -force-vector-width=X).

In general, self output dependence is fine to vectorize (whether the store address is uniform or random), as long as (masked) scatter (or scatter emulation) happens from lower elements to higher elements.

I don't think the above comment matters for uniform addresses because a uniform address is invariant. This is what the langref states for scatter intrinsic (https://llvm.org/docs/LangRef.html#id1792):

. The data stored in memory is a vector of any integer, floating-point or pointer data type. Each vector element is stored in an arbitrary memory address. Scatter with overlapping addresses is guaranteed to be ordered from least-significant to most-significant element.

The scatter address is not overlapping for the uniform address. It is the exact same address. This is the code that gets generated for uniform stores on skylake with AVX-512 support once I fixed the bug in this patch (the scatter location is the same address and the stored value is also the same, and the mask is the vector of booleans):
pseudo code:

if (b[i] ==k)
  a = ntrunc; <-- uniform store based on condition above.

IR generated:

vector.ph:
  %broadcast.splatinsert5 = insertelement <16 x i32> undef, i32 %k, i32 0
  %broadcast.splat6 = shufflevector <16 x i32> %broadcast.splatinsert5, <16 x i32> undef, <16 x i32> zeroinitializer <-- vector splat of k
  %broadcast.splatinsert9 = insertelement <16 x i32*> undef, i32* %a, i32 0
  %broadcast.splat10 = shufflevector <16 x i32*> %broadcast.splatinsert9, <16 x i32*> undef, <16 x i32> zeroinitializer <-- vector splat of i32* a.

vector.body:
 %2 = getelementptr inbounds i32, i32* %b, i64 %index
  %3 = bitcast i32* %2 to <16 x i32>*
  %wide.load = load <16 x i32>, <16 x i32>* %3, align 8
  %4 = icmp eq <16 x i32> %wide.load, %broadcast.splat6
call void @llvm.masked.scatter.v16i32.v16p0i32(<16 x i32> %broadcast.splat8, <16 x i32*> %broadcast.splat10, i32 4, <16 x i1> %4) <--scatter storing the same element into the same address (a), depending on same condition b[i] == k

For other architectures that do not have these masked intrinsics, we just generate the predicated store by doing an extract and branch on each lane (correct but inefficient and will be avoided unless -force-vector-width=X).

In general, self output dependence is fine to vectorize (whether the store address is uniform or random), as long as (masked) scatter (or scatter emulation) happens from lower elements to higher elements.

I don't think the above comment matters for uniform addresses because a uniform address is invariant.

Only if you are storing uniform value.

This is what the langref states for scatter intrinsic (https://llvm.org/docs/LangRef.html#id1792):

. The data stored in memory is a vector of any integer, floating-point or pointer data type. Each vector element is stored in an arbitrary memory address. Scatter with overlapping addresses is guaranteed to be ordered from least-significant to most-significant element.

Thanks for reminding me that the intrinsic is defined with the ordering requirement.

We should also consider doing this, depending on the cost of branch versus masked scatter. For the targets w/o masked scatter, this should be better than masked scatter emulation.

%5 = bitcast <16xi1> %4 to <i16>
%6 = icmp eq <i16> %5, <i16> zero
br <i1> %6 skip fall
fall:
store <i32> %ntrunc, <i32*> %a
br skip
skip:

In D50665#1212899, @hsaito wrote:

We should also consider doing this, depending on the cost of branch versus masked scatter. For the targets w/o masked scatter, this should be better than masked scatter emulation.

%5 = bitcast <16xi1> %4 to <i16>
%6 = icmp eq <i16> %5, <i16> zero
br <i1> %6 skip fall
fall:
store <i32> %ntrunc, <i32*> %a
br skip
skip:

Yes, that is the improved codegen stated as TODO in the costmodel. Today both the costmodel and the code gen will identify it as a normal predicated store: series of branches and stores. Also, we need to differentiate these 2 cases:

if(b[i] ==k)
 a = ntrunc;

versus

if(b[i] ==k)
  a = ntrunc;
else
  a = m;

The second example should be converted into a vector-select based on b[i] == k and the last element will be extracted out of the vector select and stored into a.
However, if for some reason, it is not converted into a select and just left as 2 predicated stores, it is incorrect to use the same code transformation as we'll do for the first example. For the first example, we see if all values in the conditional is false, and we skip the store. In the second case, we need to store a value, but that value is just decided by the last element of the conditional. Just 2 different forms of predicated stores.

Yes, that is the improved codegen stated as TODO in the costmodel.

Aha. OK. Thanks for the clarification.

In D50665#1212872, @anna wrote:
This is what the langref states for scatter intrinsic (https://llvm.org/docs/LangRef.html#id1792):
. The data stored in memory is a vector of any integer, floating-point or pointer data type. Each vector element is stored in an arbitrary memory address. Scatter with overlapping addresses is guaranteed to be ordered from least-significant to most-significant element.

Yes, this was intentional, precisely to support vectorization of (possibly) self-overwriting stores.

...
This is the code that gets generated for uniform stores on skylake with AVX-512 support once I fixed the bug in this patch
...

LGTM.
Indeed, care must be taken to avoid using more than one masked scatter to the same invariant address; but LAA should flag such non-self cross-iteration dependencies.

include/llvm/Analysis/LoopAccessAnalysis.h
574	Could rename `VariantStoreToLoopInvariantAddress` to `HasVariantStoreToLoopInvariantAddress`.
lib/Analysis/LoopAccessAnalysis.cpp
1871	It is conceivable that stores of invariant values to invariant addresses can participate in a subset of unsafe scenarios, which may be easier for LAA to detect, and thus start by treating only stores of invariant values to invariant addresses. But storing a variant phi whose "dominating" compares are not all invariant, could conceptually produce arbitrary variant values and dependencies, despite having invariant values for all other operands of the phi; e.g., 0 and -1. Presumably, this case does not differ, from LAA perspective, from stores of any variant value to invariant address.
lib/Transforms/Vectorize/LoopVectorize.cpp
5762	Agreed. Misled by the erroneous isLoopInvariantStoreValue() name.

anna mentioned this in D51313: [LV] Fix code gen for conditionally executed uniform loads.Aug 27 2018, 9:19 AM

added test for conditional uniform store for AVX512. Rebased over fix in D51313.

Harbormaster completed remote builds in B22092: Diff 163328.Aug 30 2018, 7:51 AM

This patch now only vectorizes invariant values stored into invariant addresses. It also correctly handles conditionally executed stores (fixed bug for scatter code generation in AVX512).

rebased over D51313.

ping

Harbormaster completed remote builds in B22422: Diff 164667.Sep 10 2018, 7:18 AM

Best allow only a single store to an invariant address for now; until we're sure the last one to store is always identified correctly.

include/llvm/Analysis/LoopAccessAnalysis.h
572	/// Checks existence of stores to invariant address inside loop. /// If such stores exist, checks if those are stores of variant values. can be updated and simplified into something like /// If the loop has any store of a variant value to an invariant address, then return true, else return false.
lib/Analysis/LoopAccessAnalysis.cpp
1880	How about `isUniform(Ptr) && !isUniform(ST->getValueOperand())` ? Relying more consistently on SCEV to determine invariance of both address and stored value. Is there a reason for treating stored value more conservatively, checking its invariance by asking if it's outside the loop?
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
760	update the message as well: "write of variant value to a loop invariant address ..."
lib/Transforms/Vectorize/LoopVectorize.cpp
5762	Complementing the consistent use of isUniform rather than isLoopInvariant: `bool isLoopInvariantStoreValue = Legal->isUniform(SI->getValueOperand());` ? , similar to the way the address is checked to be uniform before calling this method below.
5876	Comment can be simplified to something like // TODO: Avoid replicating loads and stores instead of // relying on instcombine to remove them.
test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
16 ↗	(On Diff #164667)	Have one space instead of two between i32 and %ntrunc on the check-not'd store. Easier to see that this checks for a single copy of the store, i.e., that instcombine eliminated all redundant copies. May want to comment what this test is designed to check.
131 ↗	(On Diff #164667)	inv_val_load_to?
test/Transforms/LoopVectorize/invariant-store-vectorization.ll
12	"that check whether" >> "check that" ("whether" usually comes with an "or not")
81	"as identifying these" >> "identify them" Do we check what the cost model identifies?
133	Hmm, multiple stores to the same invariant address did not trigger LAI memory dependence checks(?) This may generate wrong code if the conditional scalarized stores are emitted in the wrong order, or if a pair of masked scatters are used.
152	good to continue CHECKing that EE1 is used in a branch that guards a store of %ntrunc to %a.
183	Now the store/s is/are no longer of invariant value/s.
184	.. once we support vectorizing stores of variant values to invariant addresses
220	.. efficiently once divergence analysis identifies storeval as uniform
253	"even though it's" >> "once we support"
test/Transforms/LoopVectorize/pr31190.ll
33	CHECK vectorized code emitted, or debug info stating it can be vectorized?

Hi Ayal, thanks for your detailed review!

In D50665#1231754, @Ayal wrote:

Best allow only a single store to an invariant address for now; until we're sure the last one to store is always identified correctly.

I've updated the patch to restrict to this case for now (diff coming up soon). Generally, if we have multiple stores to an invariant address, it might be canonicalized by InstCombine. So, this may not be as inhibiting as it sounds. Keeping this restriction and allowing "variant stores to invariant addresses" seems like a logical next step once this lands.

lib/Analysis/LoopAccessAnalysis.cpp
1880	Nothing specific. This works as well. I've changed it. As a separate change, we'll need to improve `isUniform` because they consider uniform FP values are non-uniform (since FP is non-scevable).
test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
131 ↗	(On Diff #164667)	updated name.
test/Transforms/LoopVectorize/invariant-store-vectorization.ll
81	since we dont have debug statements for what the cost model identifies this, I've updated the above comment.
133	good point - as stated in comment earlier, I will restrict to one store to invariant address for now.
220	once we relax the check of variant/invariant value being stored, it does not matter if we correctly identify if it is variant or invariant. So, I think divergence analysis is not required.

addressed review comments.

Harbormaster completed remote builds in B22814: Diff 166025.Sep 18 2018, 1:40 PM

ping

Thanks for taking care of everything, this LGTM now, added only a few minor optional comments.

lib/Analysis/LoopAccessAnalysis.cpp
1890	Maybe clearer to do if (isUniform(Ptr)) { // Consider multiple stores to the same uniform address as a store of a variant value. bool MultipleStoresToUniformPtr = UniformStores.insert(Ptr).second; HasVariantStoreToLoopInvariantAddress \|= (!isUniform(ST->getValueOperand()) \|\| MultipleStoresToUniformPtr); } Note that supporting a single store of a variant value to an invariant address is easier than supporting multiple (conditional) stores of invariant values to an invariant address, as discussed. So the two conditions should probably be separated when the patch taking care of the former is introduced.
lib/Transforms/Vectorize/LoopVectorize.cpp
1488	The ": extract of last element" part is for future use, when stores of variant values to invariant addresses are supported, right? Best leave this part to that future patch, or add a TODO to test this extra cost then.
5875	No need to add these enclosing curly brackets.
test/Transforms/LoopVectorize/invariant-store-vectorization.ll
220	ok, works both ways - once we leverage divergence analysis we'll be able to handle such a store of uniform/invariant value, w/o needing relaxed support for stores of variant values.

This revision is now accepted and ready to land.Sep 24 2018, 2:52 PM

anna marked 3 inline comments as done.Sep 25 2018, 8:30 AM

anna added inline comments.

lib/Analysis/LoopAccessAnalysis.cpp
1890	done. Should be `bool MultipleStoresToUniformPtr = !UniformStores.insert(Ptr).second;`

Closed by commit rL343028: [LV][LAA] Vectorize loop invariant values stored into loop invariant address (authored by annat). · Explain WhySep 25 2018, 1:58 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

Analysis/

LoopAccessAnalysis.h

11 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

20 lines

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

2 lines

LoopVectorize.cpp

27 lines

test/

Transforms/

LoopVectorize/

invariant-store-vectorization.ll

277 lines

pr31190.ll

3 lines

Diff 161525

include/llvm/Analysis/LoopAccessAnalysis.h

Show First 20 Lines • Show All 561 Lines • ▼ Show 20 Lines	public:
bool hasStride(Value *V) const { return StrideSet.count(V); }		bool hasStride(Value *V) const { return StrideSet.count(V); }

/// Print the information about the memory accesses in the loop.		/// Print the information about the memory accesses in the loop.
void print(raw_ostream &OS, unsigned Depth = 0) const;		void print(raw_ostream &OS, unsigned Depth = 0) const;

/// Checks existence of store to invariant address inside loop.		/// Checks existence of store to invariant address inside loop.
/// If the loop has any store to invariant address, then it returns true,		/// If the loop has any store to invariant address, then it returns true,
/// else returns false.		/// else returns false.
bool hasStoreToLoopInvariantAddress() const {		bool hasStoreToLoopInvariantAddress() const {
		AyalUnsubmitted Not Done Reply Inline Actions This becomes dead? Ayal: This becomes dead?
		annaAuthorUnsubmitted Not Done Reply Inline Actions The idea is to retain the identification of `storeToLoopInvariantAddress` if other passes which use LAA need it. That's the reason I separated out the `StoreToLoopInvariantAddress` and `NonVectorizableStoreToLoopInvariantAddress`. anna: The idea is to retain the identification of `storeToLoopInvariantAddress` if other passes which…
		AyalUnsubmitted Not Done Reply Inline Actions OK. But LoopVectorizationLegality below seems to be its only user. Ayal: OK. But LoopVectorizationLegality below seems to be its only user.
		annaAuthorUnsubmitted Not Done Reply Inline Actions yes, that's right. I made the change, but the analysis has an ORE and there are 5 tests in the LoopAccessAnalysis that are failing because the ORE check "Store to invariant address was [not] found in loop" is missing. See test/Analysis/LoopAccessAnalysis/store-to-invariant-check1.ll where it looks for the presence of "Store to invariant address was found in loop". I'll remove the code as a follow on clean up and if there's a need for this by other passes that use LAA, folks can add it back when required. I think it also makes sense to add an ORE for the "VariantStoreToInvariantAddress" as part of this current change. anna: yes, that's right. I made the change, but the analysis has an ORE and there are 5 tests in the…
		annaAuthorUnsubmitted Not Done Reply Inline Actions I've made both the changes in this patch since changing the ORE is clearer in one patch. anna: I've made both the changes in this patch since changing the ORE is clearer in one patch.
return StoreToLoopInvariantAddress;		return StoreToLoopInvariantAddress;
}		}
		AyalUnsubmitted Done Reply Inline Actions Update above comment as well: "non-vectorizable stores" >> "of variant values" Ayal: Update above comment as well: "non-vectorizable stores" >> "of variant values"
		AyalUnsubmitted Done Reply Inline Actions /// Checks existence of stores to invariant address inside loop. /// If such stores exist, checks if those are stores of variant values. can be updated and simplified into something like /// If the loop has any store of a variant value to an invariant address, then return true, else return false. Ayal: ``` /// Checks existence of stores to invariant address inside loop. /// If such stores exist…

		/// Checks existence of stores to invariant address inside loop.
		AyalUnsubmitted Done Reply Inline Actions Could rename `VariantStoreToLoopInvariantAddress` to `HasVariantStoreToLoopInvariantAddress`. Ayal: Could rename `VariantStoreToLoopInvariantAddress` to `HasVariantStoreToLoopInvariantAddress`.
		/// If such stores exist, checks if those are non-vectorizable stores.
		bool hasNonVectorizableStoreToLoopInvariantAddress() const {
		return NonVectorizableStoreToLoopInvariantAddress;
		}
		AyalUnsubmitted Done Reply Inline Actions AND both indicators? Ayal: AND both indicators?
		annaAuthorUnsubmitted Done Reply Inline Actions uh oh. was an older change. will fix. anna: uh oh. was an older change. will fix.

/// Used to add runtime SCEV checks. Simplifies SCEV expressions and converts		/// Used to add runtime SCEV checks. Simplifies SCEV expressions and converts
/// them to a more usable form. All SCEV expressions during the analysis		/// them to a more usable form. All SCEV expressions during the analysis
/// should be re-written (and therefore simplified) according to PSE.		/// should be re-written (and therefore simplified) according to PSE.
/// A user of LoopAccessAnalysis will need to emit the runtime checks		/// A user of LoopAccessAnalysis will need to emit the runtime checks
/// associated with this predicate.		/// associated with this predicate.
const PredicatedScalarEvolution &getPSE() const { return *PSE; }		const PredicatedScalarEvolution &getPSE() const { return *PSE; }

private:		private:
Show All 38 Lines	private:

/// Cache the result of analyzeLoop.		/// Cache the result of analyzeLoop.
bool CanVecMem;		bool CanVecMem;

/// Indicator for storing to uniform addresses.		/// Indicator for storing to uniform addresses.
/// If a loop has write to a loop invariant address then it should be true.		/// If a loop has write to a loop invariant address then it should be true.
bool StoreToLoopInvariantAddress;		bool StoreToLoopInvariantAddress;

		/// Indicator that there is a store to uniform address that is
		/// non-vectorizable.
		/// These are a subset of stores identified through
		/// StoreToLoopInvariantAddress.
		bool NonVectorizableStoreToLoopInvariantAddress;
		AyalUnsubmitted Done Reply Inline Actions Better name it more accurately as, e.g., `VariantStoreToLoopInvariantAddress`? Ayal: Better name it more accurately as, e.g., `VariantStoreToLoopInvariantAddress`?
		annaAuthorUnsubmitted Not Done Reply Inline Actions Okay, I'll change the name. JFI- Today the changed name (VariantStoreToLoopInvariantAddress) is accurate. However, my plan is to eventually teach the vectorizer about all safe uniform stores, not just invariant values stored to invariant address. So `VariantStoreToLoopInvariantAddress` can also be vectorized under certain conditions (safe dependence distance calculation for the store versus other memory access). So something like example below can be vectorized [1]: for (i=0; i<n;i++) for (j=0; j<n; j++) { p[i] = b[j]; z[j] += b[j]; } } However, this cannot be vectorized safely: for (i=0; i<n;i++) for (j=0; j<n; j++) { z[j] = (++p[i]); <-- dependence distance for uniform store and load is 1. } } [1] LICM should try to sink the store out of inner loop, but sometimes it cannot do so because it cannot prove dereferencability for the store address or that the store is guaranteed to execute at least once. anna: Okay, I'll change the name. JFI- Today the changed name (VariantStoreToLoopInvariantAddress)…
		AyalUnsubmitted Not Done Reply Inline Actions Yes, in/variant stores to an invariant address may carry cross-iteration dependencies with other loads/store, which could potentially be checked at runtime similar to 'regular' stores. LV supports reductions/inductions if carried by temporaries only, rather than via memory. Such cases should indeed be LICM'd before vectorization - sinking unconditional stores down to a dominated "middle" block, where it's dereferencable and known to have executed at least once. Ayal: Yes, in/variant stores to an invariant address may carry cross-iteration dependencies with…
/// The diagnostics report generated for the analysis. E.g. why we		/// The diagnostics report generated for the analysis. E.g. why we
/// couldn't analyze the loop.		/// couldn't analyze the loop.
std::unique_ptr<OptimizationRemarkAnalysis> Report;		std::unique_ptr<OptimizationRemarkAnalysis> Report;

/// If an access has a symbolic strides, this maps the pointer value to		/// If an access has a symbolic strides, this maps the pointer value to
/// the stride symbol.		/// the stride symbol.
ValueToValueMap SymbolicStrides;		ValueToValueMap SymbolicStrides;

▲ Show 20 Lines • Show All 128 Lines • Show Last 20 Lines

lib/Analysis/LoopAccessAnalysis.cpp

Show First 20 Lines • Show All 1,856 Lines • ▼ Show 20 Lines	void LoopAccessInfo::analyzeLoop(AliasAnalysis AA, LoopInfo LI,

// Holds the analyzed pointers. We don't want to call GetUnderlyingObjects		// Holds the analyzed pointers. We don't want to call GetUnderlyingObjects
// multiple times on the same object. If the ptr is accessed twice, once		// multiple times on the same object. If the ptr is accessed twice, once
// for read and once for write, it will only appear once (on the write		// for read and once for write, it will only appear once (on the write
// list). This is okay, since we are going to check for conflicts between		// list). This is okay, since we are going to check for conflicts between
// writes and between reads and writes, but not between reads and reads.		// writes and between reads and writes, but not between reads and reads.
ValueSet Seen;		ValueSet Seen;

		auto isLoopInvariant = [this](StoreInst *ST) {
		AyalUnsubmitted Done Reply Inline Actions `isLoopInvariantStoreValue` ? Ayal: `isLoopInvariantStoreValue` ?
		auto StoreVal = ST->getValueOperand();
		if (TheLoop->isLoopInvariant(StoreVal))
		return true;
		if (!isa<Instruction>(StoreVal))
		return false;
		return TheLoop->hasLoopInvariantOperands(cast<Instruction>(StoreVal));
		AyalUnsubmitted Not Done Reply Inline Actions Again, something LICM may have missed? Ayal: Again, something LICM may have missed?
		annaAuthorUnsubmitted Not Done Reply Inline Actions yes, LICM misses this as well - see added test case in `inv_val_store_to_inv_address_conditional_inv`. anna: yes, LICM misses this as well - see added test case in…
		AyalUnsubmitted Not Done Reply Inline Actions Ahh, but a phi of invariant values is invariant iff the compares that decide which predecessor will reach the phi, are also invariant. In `inv_val_store_to_inv_address_conditional_inv` this holds because there `%cmp` determines which predecessor it'll be, and `%cmp` is invariant. In general Divergence Analysis is designed to provide this answer, as in D50433's `isUniform()`. Ayal: Ahh, but a phi of invariant values is invariant iff the compares that decide which predecessor…
		annaAuthorUnsubmitted Not Done Reply Inline Actions yes, that's right. Note that this patch handles phis of invariant values based on either an invariant condition or a variant condition (see `inv_val_store_to_inv_address_conditional_diff_values_ic` where the phi result is based on a varying condition). The improved codegen and cost model handling is for predicated stores, where the block containing the invariant store is to be predicated. Today, we just handle this as a "predicated store" cost and generate the code gen accordingly. anna: yes, that's right. Note that this patch handles phis of invariant values based on either an…
		AyalUnsubmitted Not Done Reply Inline Actions So the suggested `isLoopInvariantStoreValue` name is incorrect, as the store value may be variant. What's special about these variant values - why not handle any store value? Yes, conditional stores to an invariant address will end up scalarized and predicated, i.e., with a branch-and-store per lane, which is quite inefficient. A masked scatter may work better there, until optimized by a single branch-and-store if any lane is masked-on (invariant stored value) or single branch-and-store of last masked-on lane (in/variant stored value). Ayal: So the suggested `isLoopInvariantStoreValue` name is incorrect, as the store value may be…
		annaAuthorUnsubmitted Not Done Reply Inline Actions I am currently adding the support for any variant stores to invariant address. The unsafe cross iteration dependencies are identified through `LAA: unsafe dependent memory operations in loop`. This was identified without any changes required from my side to the LAA memory conflict detection. However, I'm not sure if LAA handles all cases exhaustively. The reason I started with this sub-patch is that when the stored value is not a varying memory access from within the loop (that's what this blob of code is really trying to do - see `isLoopInvariant` and `hasLoopInvariantOperands`), we don't need to reason about whether LAA handles all memory conflict detection. When the stored value can be any variant value, we need to make sure that the LAA pass handles all the memory conflicts correctly or update LAA if that isn't the case. anna: I am currently adding the support for any variant stores to invariant address. The unsafe…
		AyalUnsubmitted Not Done Reply Inline Actions It is conceivable that stores of invariant values to invariant addresses can participate in a subset of unsafe scenarios, which may be easier for LAA to detect, and thus start by treating only stores of invariant values to invariant addresses. But storing a variant phi whose "dominating" compares are not all invariant, could conceptually produce arbitrary variant values and dependencies, despite having invariant values for all other operands of the phi; e.g., 0 and -1. Presumably, this case does not differ, from LAA perspective, from stores of any variant value to invariant address. Ayal: It is conceivable that stores of invariant values to invariant addresses can participate in a…
		};

for (StoreInst *ST : Stores) {		for (StoreInst *ST : Stores) {
Value *Ptr = ST->getPointerOperand();		Value *Ptr = ST->getPointerOperand();
		bool isUniformPtr = isUniform(Ptr);
// Check for store to loop invariant address.		// Check for store to loop invariant address.
StoreToLoopInvariantAddress \|= isUniform(Ptr);		StoreToLoopInvariantAddress \|= isUniformPtr;

		// Loop invariant values stored into loop invariant addresses are
		AyalUnsubmitted Done Reply Inline Actions How about `isUniform(Ptr) && !isUniform(ST->getValueOperand())` ? Relying more consistently on SCEV to determine invariance of both address and stored value. Is there a reason for treating stored value more conservatively, checking its invariance by asking if it's outside the loop? Ayal: How about `isUniform(Ptr) && !isUniform(ST->getValueOperand())` ? Relying more consistently on…
		annaAuthorUnsubmitted Not Done Reply Inline Actions Nothing specific. This works as well. I've changed it. As a separate change, we'll need to improve `isUniform` because they consider uniform FP values are non-uniform (since FP is non-scevable). anna: Nothing specific. This works as well. I've changed it. As a separate change, we'll need to…
		// vectorizable.
		NonVectorizableStoreToLoopInvariantAddress \|=
		(isUniformPtr && !isLoopInvariant(ST));

// If we did not see this pointer before, insert it to the read-write		// If we did not see this pointer before, insert it to the read-write
// list. At this phase it is only a 'write' list.		// list. At this phase it is only a 'write' list.
if (Seen.insert(Ptr).second) {		if (Seen.insert(Ptr).second) {
++NumReadWrites;		++NumReadWrites;

MemoryLocation Loc = MemoryLocation::get(ST);		MemoryLocation Loc = MemoryLocation::get(ST);
		AyalUnsubmitted Not Done Reply Inline Actions Maybe clearer to do if (isUniform(Ptr)) { // Consider multiple stores to the same uniform address as a store of a variant value. bool MultipleStoresToUniformPtr = UniformStores.insert(Ptr).second; HasVariantStoreToLoopInvariantAddress \|= (!isUniform(ST->getValueOperand()) \|\| MultipleStoresToUniformPtr); } Note that supporting a single store of a variant value to an invariant address is easier than supporting multiple (conditional) stores of invariant values to an invariant address, as discussed. So the two conditions should probably be separated when the patch taking care of the former is introduced. Ayal: Maybe clearer to do ``` if (isUniform(Ptr)) { // Consider multiple stores to the same…
		annaAuthorUnsubmitted Not Done Reply Inline Actions done. Should be `bool MultipleStoresToUniformPtr = !UniformStores.insert(Ptr).second;` anna: done. Should be `bool MultipleStoresToUniformPtr = !UniformStores.insert(Ptr).second;`
// The TBAA metadata could have a control dependency on the predication		// The TBAA metadata could have a control dependency on the predication
// condition, so we cannot rely on it when determining whether or not we		// condition, so we cannot rely on it when determining whether or not we
// need runtime pointer checks.		// need runtime pointer checks.
if (blockNeedsPredication(ST->getParent(), TheLoop, DT))		if (blockNeedsPredication(ST->getParent(), TheLoop, DT))
Loc.AATags.TBAA = nullptr;		Loc.AATags.TBAA = nullptr;

Accesses.addStore(Loc);		Accesses.addStore(Loc);
}		}
▲ Show 20 Lines • Show All 377 Lines • ▼ Show 20 Lines

LoopAccessInfo::LoopAccessInfo(Loop L, ScalarEvolution SE,		LoopAccessInfo::LoopAccessInfo(Loop L, ScalarEvolution SE,
const TargetLibraryInfo TLI, AliasAnalysis AA,		const TargetLibraryInfo TLI, AliasAnalysis AA,
DominatorTree DT, LoopInfo LI)		DominatorTree DT, LoopInfo LI)
: PSE(llvm::make_unique<PredicatedScalarEvolution>(SE, L)),		: PSE(llvm::make_unique<PredicatedScalarEvolution>(SE, L)),
PtrRtChecking(llvm::make_unique<RuntimePointerChecking>(SE)),		PtrRtChecking(llvm::make_unique<RuntimePointerChecking>(SE)),
DepChecker(llvm::make_unique<MemoryDepChecker>(*PSE, L)), TheLoop(L),		DepChecker(llvm::make_unique<MemoryDepChecker>(*PSE, L)), TheLoop(L),
NumLoads(0), NumStores(0), MaxSafeDepDistBytes(-1), CanVecMem(false),		NumLoads(0), NumStores(0), MaxSafeDepDistBytes(-1), CanVecMem(false),
StoreToLoopInvariantAddress(false) {		StoreToLoopInvariantAddress(false), NonVectorizableStoreToLoopInvariantAddress(false) {
if (canAnalyzeLoop())		if (canAnalyzeLoop())
analyzeLoop(AA, LI, TLI, DT);		analyzeLoop(AA, LI, TLI, DT);
}		}

void LoopAccessInfo::print(raw_ostream &OS, unsigned Depth) const {		void LoopAccessInfo::print(raw_ostream &OS, unsigned Depth) const {
if (CanVecMem) {		if (CanVecMem) {
OS.indent(Depth) << "Memory dependences are safe";		OS.indent(Depth) << "Memory dependences are safe";
if (MaxSafeDepDistBytes != -1ULL)		if (MaxSafeDepDistBytes != -1ULL)
▲ Show 20 Lines • Show All 101 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show First 20 Lines • Show All 749 Lines • ▼ Show 20 Lines	if (LAR) {
ORE->emit([&]() {		ORE->emit([&]() {
return OptimizationRemarkAnalysis(Hints->vectorizeAnalysisPassName(),		return OptimizationRemarkAnalysis(Hints->vectorizeAnalysisPassName(),
"loop not vectorized: ", *LAR);		"loop not vectorized: ", *LAR);
});		});
}		}
if (!LAI->canVectorizeMemory())		if (!LAI->canVectorizeMemory())
return false;		return false;

if (LAI->hasStoreToLoopInvariantAddress()) {		if (LAI->hasNonVectorizableStoreToLoopInvariantAddress()) {
ORE->emit(createMissedAnalysis("CantVectorizeStoreToLoopInvariantAddress")		ORE->emit(createMissedAnalysis("CantVectorizeStoreToLoopInvariantAddress")
<< "write to a loop invariant address could not be vectorized");		<< "write to a loop invariant address could not be vectorized");
		AyalUnsubmitted Done Reply Inline Actions update the message as well: "write of variant value to a loop invariant address ..." Ayal: update the message as well: "write of variant value to a loop invariant address ..."
LLVM_DEBUG(dbgs() << "LV: We don't allow storing to uniform addresses\n");		LLVM_DEBUG(dbgs() << "LV: We don't allow storing to uniform addresses\n");
return false;		return false;
}		}

Requirements->addRuntimePointerChecks(LAI->getNumRuntimePointerChecks());		Requirements->addRuntimePointerChecks(LAI->getNumRuntimePointerChecks());
PSE.addPredicate(LAI->getPSE().getUnionPredicate());		PSE.addPredicate(LAI->getPSE().getUnionPredicate());

return true;		return true;
▲ Show 20 Lines • Show All 305 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,476 Lines • ▼ Show 20 Lines	private:

/// The cost computation for Gather/Scatter instruction.		/// The cost computation for Gather/Scatter instruction.
unsigned getGatherScatterCost(Instruction *I, unsigned VF);		unsigned getGatherScatterCost(Instruction *I, unsigned VF);

/// The cost computation for widening instruction \p I with consecutive		/// The cost computation for widening instruction \p I with consecutive
/// memory access.		/// memory access.
unsigned getConsecutiveMemOpCost(Instruction *I, unsigned VF);		unsigned getConsecutiveMemOpCost(Instruction *I, unsigned VF);

/// The cost calculation for Load instruction \p I with uniform pointer -		/// The cost calculation for Load/Store instruction \p I with uniform pointer -
/// scalar load + broadcast.		/// Load: scalar load + broadcast.
		/// Store: scalar store + (loop invariant value stored? 0 : extract of last
		/// element)
		AyalUnsubmitted Done Reply Inline Actions The ": extract of last element" part is for future use, when stores of variant values to invariant addresses are supported, right? Best leave this part to that future patch, or add a TODO to test this extra cost then. Ayal: The ": extract of last element" part is for future use, when stores of variant values to…
unsigned getUniformMemOpCost(Instruction *I, unsigned VF);		unsigned getUniformMemOpCost(Instruction *I, unsigned VF);

/// Returns whether the instruction is a load or store and will be a emitted		/// Returns whether the instruction is a load or store and will be a emitted
/// as a vector operation.		/// as a vector operation.
bool isConsecutiveLoadOrStore(Instruction *I);		bool isConsecutiveLoadOrStore(Instruction *I);

/// Returns true if an artificially high cost for emulated masked memrefs		/// Returns true if an artificially high cost for emulated masked memrefs
/// should be used.		/// should be used.
▲ Show 20 Lines • Show All 4,249 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::getConsecutiveMemOpCost(Instruction *I,
bool Reverse = ConsecutiveStride < 0;		bool Reverse = ConsecutiveStride < 0;
if (Reverse)		if (Reverse)
Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);		Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
return Cost;		return Cost;
}		}

unsigned LoopVectorizationCostModel::getUniformMemOpCost(Instruction *I,		unsigned LoopVectorizationCostModel::getUniformMemOpCost(Instruction *I,
unsigned VF) {		unsigned VF) {
		if (isa<LoadInst>(I)) {
		AyalUnsubmitted Done Reply Inline Actions Can use `if (auto LI = dyn_cast<LoadInst>(I)) {` Ayal:* Can use `if (auto *LI = dyn_cast<LoadInst>(I)) {`
LoadInst *LI = cast<LoadInst>(I);		LoadInst *LI = cast<LoadInst>(I);
Type *ValTy = LI->getType();		Type *ValTy = LI->getType();
Type *VectorTy = ToVectorTy(ValTy, VF);		Type *VectorTy = ToVectorTy(ValTy, VF);
unsigned Alignment = LI->getAlignment();		unsigned Alignment = LI->getAlignment();
unsigned AS = LI->getPointerAddressSpace();		unsigned AS = LI->getPointerAddressSpace();

return TTI.getAddressComputationCost(ValTy) +		return TTI.getAddressComputationCost(ValTy) +
TTI.getMemoryOpCost(Instruction::Load, ValTy, Alignment, AS) +		TTI.getMemoryOpCost(Instruction::Load, ValTy, Alignment, AS) +
		AyalUnsubmitted Done Reply Inline Actions Should be consistent and use the same `isLoopInvariantStoreValue()` noted above. Ayal: Should be consistent and use the same `isLoopInvariantStoreValue()` noted above.
		AyalUnsubmitted Not Done Reply Inline Actions This is still inconsistent with the OR-operands-are-invariant above. Ayal: This is still inconsistent with the OR-operands-are-invariant above.
		annaAuthorUnsubmitted Not Done Reply Inline Actions will update. anna: will update.
		annaAuthorUnsubmitted Not Done Reply Inline Actions actually, this is correct. We don't need to update it to the "incorrectly named" lambda above. We need to do an extract if the value is not invariant: example case: for.body: ; preds = %for.body, %entry %i = phi i64 [ %i.next, %latch ], [ 0, %entry ] %tmp1 = getelementptr inbounds i32, i32* %b, i64 %i %tmp2 = load i32, i32* %tmp1, align 8 %varying_cmp = icmp eq i32 %tmp2, %k store i32 %ntrunc, i32* %tmp1 br i1 %varying_cmp, label %cond_store, label %cond_store_k cond_store: br label %latch cond_store_k: br label %latch latch: %storeval = phi i32 [ %ntrunc, %cond_store ], [ %k, %cond_store_k ] store i32 %storeval, i32* %a <-- uniform store storeval's operands are invariant, but the value being chosen in each iteration of the loop varies based on `%varying_cmp`. In this case, we need an extract and then the scalar store. That's exactly what we do as well. anna: actually, this is correct. We don't need to update it to the "incorrectly named" lambda above.
		AyalUnsubmitted Not Done Reply Inline Actions Agreed. Misled by the erroneous isLoopInvariantStoreValue() name. Ayal: Agreed. Misled by the erroneous isLoopInvariantStoreValue() name.
		AyalUnsubmitted Done Reply Inline Actions Complementing the consistent use of isUniform rather than isLoopInvariant: `bool isLoopInvariantStoreValue = Legal->isUniform(SI->getValueOperand());` ? , similar to the way the address is checked to be uniform before calling this method below. Ayal: Complementing the consistent use of isUniform rather than isLoopInvariant: `bool…
TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, VectorTy);		TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, VectorTy);
		AyalUnsubmitted Done Reply Inline Actions Indent Ayal: Indent
}		}
		StoreInst *SI = cast<StoreInst>(I);
		Type *ValTy = SI->getType();
		Type *VectorTy = ToVectorTy(ValTy, VF);
		unsigned Alignment = SI->getAlignment();
		unsigned AS = SI->getPointerAddressSpace();

		bool isLoopInvariantValueStored =
		TheLoop->isLoopInvariant(SI->getValueOperand());
		return TTI.getAddressComputationCost(ValTy) +
		TTI.getMemoryOpCost(Instruction::Store, ValTy, Alignment, AS) +
		(isLoopInvariantValueStored ? 0 : TTI.getVectorInstrCost(
		Instruction::ExtractElement,
		VectorTy, VF - 1));
		}

unsigned LoopVectorizationCostModel::getGatherScatterCost(Instruction *I,		unsigned LoopVectorizationCostModel::getGatherScatterCost(Instruction *I,
unsigned VF) {		unsigned VF) {
Type *ValTy = getMemInstValueType(I);		Type *ValTy = getMemInstValueType(I);
Type *VectorTy = ToVectorTy(ValTy, VF);		Type *VectorTy = ToVectorTy(ValTy, VF);
unsigned Alignment = getMemInstAlignment(I);		unsigned Alignment = getMemInstAlignment(I);
Value *Ptr = getLoadStorePointerOperand(I);		Value *Ptr = getLoadStorePointerOperand(I);

▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::setCostBasedWideningDecision(unsigned VF) {
NumPredStores = 0;		NumPredStores = 0;
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
// For each instruction in the old loop.		// For each instruction in the old loop.
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
Value *Ptr = getLoadStorePointerOperand(&I);		Value *Ptr = getLoadStorePointerOperand(&I);
if (!Ptr)		if (!Ptr)
continue;		continue;

if (isa<StoreInst>(&I) && isScalarWithPredication(&I))		if (isa<StoreInst>(&I) && isScalarWithPredication(&I))
		annaAuthorUnsubmitted Not Done Reply Inline Actions Predicated uniform stores will fall under this cost model. The next patch will be to address the improved code gen for this case and update the cost model for predicated uniform stores. anna: Predicated uniform stores will fall under this cost model. The next patch will be to address…
NumPredStores++;		NumPredStores++;
if (isa<LoadInst>(&I) && Legal->isUniform(Ptr)) {
		AyalUnsubmitted Done Reply Inline Actions No need to add these enclosing curly brackets. Ayal: No need to add these enclosing curly brackets.
// Scalar load + broadcast		if ((isa<LoadInst>(&I) \|\| isa<StoreInst>(&I)) && Legal->isUniform(Ptr)) {
		AyalUnsubmitted Done Reply Inline Actions We expect here that `isa<LoadInst>(&I) \|\| isa<StoreInst>(&I)` (as `memoryInstructionCanBeWidened()` will assert below) having checked `getLoadStorePointerOperand(&I)` above. Ayal: We expect here that `isa<LoadInst>(&I) \|\| isa<StoreInst>(&I)` (as…
		AyalUnsubmitted Done Reply Inline Actions Comment can be simplified to something like // TODO: Avoid replicating loads and stores instead of // relying on instcombine to remove them. Ayal: Comment can be simplified to something like // TODO: Avoid replicating loads and stores…
		// Load: Scalar load + broadcast
		// Store: Scalar store + isLoopInvariantValueStored ? 0 : extract
unsigned Cost = getUniformMemOpCost(&I, VF);		unsigned Cost = getUniformMemOpCost(&I, VF);
setWideningDecision(&I, VF, CM_Scalarize, Cost);		setWideningDecision(&I, VF, CM_Scalarize, Cost);
		AyalUnsubmitted Done Reply Inline Actions On certain targets, e.g., skx, an invariant store may end up as a scatter, so setting this decision here to avoid that is important; potentially worthy of a note / a test. Ayal: On certain targets, e.g., skx, an invariant store may end up as a scatter, so setting this…
		annaAuthorUnsubmitted Not Done Reply Inline Actions thanks for bringing this up. It exercised the `X86TTIImpl::getMemoryOpCost` which showed the bug in my previous diff for `LoopVectorizationCostModel::getUniformMemOpCost` for uniform store. I was passing in the store's type instead of the store val type. I've also updated it to use the "unified" interface for load/store just like the other cost model calculations - `getGatherScatterCost` etc. anna: thanks for bringing this up. It exercised the `X86TTIImpl::getMemoryOpCost` which showed the…
		AyalUnsubmitted Not Done Reply Inline Actions very good Ayal: very good
continue;		continue;
}		}

// We assume that widening is the best solution when possible.		// We assume that widening is the best solution when possible.
if (memoryInstructionCanBeWidened(&I, VF)) {		if (memoryInstructionCanBeWidened(&I, VF)) {
unsigned Cost = getConsecutiveMemOpCost(&I, VF);		unsigned Cost = getConsecutiveMemOpCost(&I, VF);
int ConsecutiveStride =		int ConsecutiveStride =
Legal->isConsecutivePtr(getLoadStorePointerOperand(&I));		Legal->isConsecutivePtr(getLoadStorePointerOperand(&I));
▲ Show 20 Lines • Show All 1,802 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/invariant-store-vectorization.ll

This file was added.

				; RUN: opt < %s -licm -loop-vectorize -force-vector-width=4 -dce -instcombine -licm -S \| FileCheck %s

				; First licm pass is to hoist/sink invariant stores if possible. Today LICM does
				; not hoist/sink the invariant stores. Even if that changes, we should still
				; vectorize this loop in case licm is not run.

				; The next licm pass after vectorization is to hoist/sink loop invariant
				; instructions.
				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"

				; all tests that check whether it is legal to vectorize the stores to invariant
				; address.
				AyalUnsubmitted Done Reply Inline Actions "that check whether" >> "check that" ("whether" usually comes with an "or not") Ayal: "that check whether" >> "check that" ("whether" usually comes with an "or not")


				; CHECK-LABEL: inv_val_store_to_inv_address_with_reduction(
				; memory check is found.conflict = b[max(n-1,1)] > a && (i8* a)+1 > (i8* b)
				; CHECK: vector.memcheck:
				; CHECK: found.conflict

				; CHECK-LABEL: vector.body:
				; CHECK: %vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ [[ADD:%[a-zA-Z0-9.]+]], %vector.body ]
				; CHECK: %wide.load = load <4 x i32>
				; CHECK: [[ADD]] = add <4 x i32> %vec.phi, %wide.load
				; CHECK-NEXT: store i32 %ntrunc, i32* %a
				; CHECK-NEXT: %index.next = add i64 %index, 4
				; CHECK-NEXT: icmp eq i64 %index.next, %n.vec
				; CHECK-NEXT: br i1

				; CHECK-LABEL: middle.block:
				; CHECK: %rdx.shuf = shufflevector <4 x i32>
				define i32 @inv_val_store_to_inv_address_with_reduction(i32* %a, i64 %n, i32* %b) {
				entry:
				%ntrunc = trunc i64 %n to i32
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
				%tmp0 = phi i32 [ %tmp3, %for.body ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				%tmp3 = add i32 %tmp0, %tmp2
				store i32 %ntrunc, i32* %a
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				%tmp4 = phi i32 [ %tmp3, %for.body ]
				ret i32 %tmp4
				}

				; CHECK-LABEL: inv_val_store_to_inv_address(
				; CHECK-LABEL: vector.body:
				; CHECK: store i32 %ntrunc, i32* %a
				; CHECK: store <4 x i32>
				; CHECK-NEXT: %index.next = add i64 %index, 4
				; CHECK-NEXT: icmp eq i64 %index.next, %n.vec
				; CHECK-NEXT: br i1
				define void @inv_val_store_to_inv_address(i32* %a, i64 %n, i32* %b) {
				entry:
				%ntrunc = trunc i64 %n to i32
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				store i32 %ntrunc, i32* %a
				store i32 %ntrunc, i32* %tmp1
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}


				; Both of these tests below are handled as predicated stores and have the cost model
				; as identifying these as predicated stores.

				AyalUnsubmitted Done Reply Inline Actions "as identifying these" >> "identify them" Do we check what the cost model identifies? Ayal: "as identifying these" >> "identify them" Do we check what the cost model identifies?
				annaAuthorUnsubmitted Not Done Reply Inline Actions since we dont have debug statements for what the cost model identifies this, I've updated the above comment. anna: since we dont have debug statements for what the cost model identifies this, I've updated the…

				; Conditional store
				; if (b[i] == k) a = ntrunc
				; TODO: We can be better with the code gen for the first test and we can have
				; just one scalar store if vector.or.reduce(vector_cmp(b[i] == k)) is 1.

				; CHECK-LABEL:inv_val_store_to_inv_address_conditional(
				; CHECK-LABEL: vector.body:
				; CHECK: %wide.load = load <4 x i32>, <4 x i32>*
				; CHECK: [[CMP:%[a-zA-Z0-9.]+]] = icmp eq <4 x i32> %wide.load, %{{.*}}
				; CHECK: store <4 x i32>
				; CHECK-NEXT: [[EE:%[a-zA-Z0-9.]+]] = extractelement <4 x i1> [[CMP]], i32 0
				; CHECK-NEXT: br i1 [[EE]], label %pred.store.if, label %pred.store.continue

				; CHECK-LABEL: pred.store.if:
				; CHECK-NEXT: store i32 %ntrunc, i32* %a
				; CHECK-NEXT: br label %pred.store.continue

				; CHECK-LABEL: pred.store.continue:
				; CHECK-NEXT: [[EE1:%[a-zA-Z0-9.]+]] = extractelement <4 x i1> [[CMP]], i32 1
				define void @inv_val_store_to_inv_address_conditional(i32* %a, i64 %n, i32* %b, i32 %k) {
				entry:
				%ntrunc = trunc i64 %n to i32
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %latch ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				%cmp = icmp eq i32 %tmp2, %k
				store i32 %ntrunc, i32* %tmp1
				br i1 %cmp, label %cond_store, label %latch

				cond_store:
				store i32 %ntrunc, i32* %a
				br label %latch

				latch:
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}

				; if (b[i] == k)
				; a = ntrunc
				; else a = k;
				; For this case, we still vectorize, by generating predicated stores for the if
				; and else cases.
				; TODO: Code gen can be improved by select(extract(vec_cmp(b[i], k), VF - 1) == 1, a = ntrunc, a = k)
				AyalUnsubmitted Not Done Reply Inline Actions Hmm, multiple stores to the same invariant address did not trigger LAI memory dependence checks(?) This may generate wrong code if the conditional scalarized stores are emitted in the wrong order, or if a pair of masked scatters are used. Ayal: Hmm, multiple stores to the same invariant address did not trigger LAI memory dependence checks…
				annaAuthorUnsubmitted Not Done Reply Inline Actions good point - as stated in comment earlier, I will restrict to one store to invariant address for now. anna: good point - as stated in comment earlier, I will restrict to one store to invariant address…
				; CHECK-LABEL:inv_val_store_to_inv_address_conditional_diff_values(
				; CHECK-LABEL: vector.body:
				; CHECK: %wide.load = load <4 x i32>, <4 x i32>*
				; CHECK: [[CMP:%[a-zA-Z0-9.]+]] = icmp eq <4 x i32> %wide.load, %{{.*}}
				; CHECK: store <4 x i32>
				; CHECK: [[CMPNOT:%[a-zA-Z0-9.]+]] = xor <4 x i1> [[CMP]], <i1 true, i1 true, i1 true, i1 true>
				; CHECK: [[EENOT1:%[a-zA-Z0-9.]+]] = extractelement <4 x i1> [[CMPNOT]], i32 0
				; CHECK: br i1 [[EENOT1]], label %pred.store.if, label %pred.store.continue

				; CHECK-LABEL: pred.store.if:
				; CHECK: store i32 %k, i32* %a
				; CHECK: br label %pred.store.continue

				; all predicated stores for a = k
				; then we check the original condition and do a predicated stores for a = ntrunc.

				; CHECK-LABEL: pred.store.continue14:
				; CHECK: [[EE1:%[a-zA-Z0-9.]+]] = extractelement <4 x i1> [[CMP]], i32 0
				define void @inv_val_store_to_inv_address_conditional_diff_values(i32* %a, i64 %n, i32* %b, i32 %k) {
				AyalUnsubmitted Not Done Reply Inline Actions good to continue CHECKing that EE1 is used in a branch that guards a store of %ntrunc to %a. Ayal: good to continue CHECKing that EE1 is used in a branch that guards a store of %ntrunc to %a.
				entry:
				%ntrunc = trunc i64 %n to i32
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %latch ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				%cmp = icmp eq i32 %tmp2, %k
				store i32 %ntrunc, i32* %tmp1
				br i1 %cmp, label %cond_store, label %cond_store_k

				cond_store:
				store i32 %ntrunc, i32* %a
				br label %latch

				cond_store_k:
				store i32 %k, i32 * %a
				br label %latch

				latch:
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}

				; Instcombine'd version of above test. Now the store is no longer predicated.
				; CHECK-LABEL: inv_val_store_to_inv_address_conditional_diff_values_ic
				AyalUnsubmitted Done Reply Inline Actions Now the store/s is/are no longer of invariant value/s. Ayal: Now the store/s is/are no longer of invariant value/s.
				; CHECK-LABEL: vector.memcheck:
				AyalUnsubmitted Done Reply Inline Actions .. once we support vectorizing stores of variant values to invariant addresses Ayal: .. once we support vectorizing stores of variant values to invariant addresses
				; CHECK-LABEL: vector.ph:
				; CHECK: [[BSPLATIN1:%[a-zA-Z0-9.]+]] = insertelement <4 x i32> undef, i32 %k, i32 0
				; CHECK: [[BSPLATK:%[a-zA-Z0-9.]+]] = shufflevector <4 x i32> [[BSPLATIN1]], <4 x i32> undef, <4 x i32> zeroinitializer
				; CHECK: [[BSPLATIN2:%[a-zA-Z0-9.]+]] = insertelement <4 x i32> undef, i32 %ntrunc, i32 0
				; CHECK: [[BSPLAT2:%[a-zA-Z0-9.]+]] = shufflevector <4 x i32> [[BSPLATIN2]], <4 x i32> undef, <4 x i32> zeroinitializer
				; CHECK-NEXT: br label %vector.body

				; CHECK-LABEL: vector.body:
				; CHECK: [[GEPB:%[a-zA-Z0-9.]+]] = getelementptr inbounds i32, i32* %b, i64 %index
				; CHECK-NEXT: [[BCB:%[a-zA-Z0-9.]+]] = bitcast i32* [[GEPB]] to <4 x i32>*
				; CHECK-NEXT: %wide.load = load <4 x i32>, <4 x i32>* [[BCB]]
				; CHECK-NEXT: [[INVCOND:%[a-zA-Z0-9.]+]] = icmp eq <4 x i32> %wide.load, [[BSPLATK]]
				; CHECK: %predphi = select <4 x i1> [[INVCOND]], <4 x i32> [[BSPLAT2]], <4 x i32> [[BSPLATK]]
				; CHECK-NEXT: [[EE:%[a-zA-Z0-9.]+]] = extractelement <4 x i32> %predphi, i32 3
				; CHECK-NEXT: store i32 [[EE]], i32* %a
				; CHECK-NEXT: %index.next = add i64 %index, 4
				define void @inv_val_store_to_inv_address_conditional_diff_values_ic(i32* %a, i64 %n, i32* %b, i32 %k) {
				entry:
				%ntrunc = trunc i64 %n to i32
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %latch ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				%cmp = icmp eq i32 %tmp2, %k
				store i32 %ntrunc, i32* %tmp1
				br i1 %cmp, label %cond_store, label %cond_store_k

				cond_store:
				br label %latch

				cond_store_k:
				br label %latch

				latch:
				AyalUnsubmitted Not Done Reply Inline Actions .. efficiently once divergence analysis identifies storeval as uniform Ayal: .. efficiently once divergence analysis identifies storeval as uniform
				annaAuthorUnsubmitted Not Done Reply Inline Actions once we relax the check of variant/invariant value being stored, it does not matter if we correctly identify if it is variant or invariant. So, I think divergence analysis is not required. anna: once we relax the check of variant/invariant value being stored, it does not matter if we…
				AyalUnsubmitted Not Done Reply Inline Actions ok, works both ways - once we leverage divergence analysis we'll be able to handle such a store of uniform/invariant value, w/o needing relaxed support for stores of variant values. Ayal: ok, works both ways - once we leverage divergence analysis we'll be able to handle such a store…
				%storeval = phi i32 [ %ntrunc, %cond_store ], [ %k, %cond_store_k ]
				store i32 %storeval, i32* %a
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}

				; invariant val stored to invariant address predicated on invariant condition
				; This is not treated as a predicated store since the block the store belongs to
				; is the latch block (which doesn't need to be predicated).
				; CHECK-LABEL: inv_val_store_to_inv_address_conditional_inv
				; CHECK-LABEL: vector.memcheck:
				; CHECK-LABEL: vector.ph:
				; CHECK: [[BSPLATIN1:%[a-zA-Z0-9.]+]] = insertelement <4 x i32> undef, i32 %ntrunc, i32 0
				; CHECK: [[BSPLATN:%[a-zA-Z0-9.]+]] = shufflevector <4 x i32> [[BSPLATIN1]], <4 x i32> undef, <4 x i32> zeroinitializer
				; CHECK: [[INSK:%[a-zA-Z0-9.]+]] = insertelement <4 x i32> undef, i32 %k, i32 3
				; CHECK: %predphi = select <4 x i1> {{.*}}, <4 x i32> [[INSK]], <4 x i32> [[BSPLATN]]
				; CHECK: [[STOREVAL:%[a-zA-Z0-9.]+]] = extractelement <4 x i32> %predphi, i32 3
				; CHECK-NEXT: br label %vector.body

				; CHECK-LABEL: vector.body:
				; CHECK: [[GEPB:%[a-zA-Z0-9.]+]] = getelementptr inbounds i32, i32* %b, i64 %index
				; CHECK-NEXT: [[BCB:%[a-zA-Z0-9.]+]] = bitcast i32* [[GEPB]] to <4 x i32>*
				; CHECK: store i32 [[STOREVAL]], i32* %a
				; CHECK-NEXT: %index.next = add i64 %index, 4
				define void @inv_val_store_to_inv_address_conditional_inv(i32* %a, i64 %n, i32* %b, i32 %k) {
				entry:
				%ntrunc = trunc i64 %n to i32
				%cmp = icmp eq i32 %ntrunc, %k
				br label %for.body
				AyalUnsubmitted Done Reply Inline Actions "even though it's" >> "once we support" Ayal: "even though it's" >> "once we support"

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %latch ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				store i32 %ntrunc, i32* %tmp1
				br i1 %cmp, label %cond_store, label %cond_store_k

				cond_store:
				br label %latch

				cond_store_k:
				br label %latch

				latch:
				%storeval = phi i32 [ %ntrunc, %cond_store ], [ %k, %cond_store_k ]
				store i32 %storeval, i32* %a
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}

test/Transforms/LoopVectorize/pr31190.ll

	Show All 23 Lines
	; Note that we can no longer get the vectorizer to actually see such PHIs,			; Note that we can no longer get the vectorizer to actually see such PHIs,
	; because LV now simplifies the loop internally, but the test is still			; because LV now simplifies the loop internally, but the test is still
	; useful as a regression test, and in case loop-simplify behavior changes.			; useful as a regression test, and in case loop-simplify behavior changes.

	@c = external global i32, align 4			@c = external global i32, align 4
	@a = external global i32, align 4			@a = external global i32, align 4
	@b = external global [1 x i32], align 4			@b = external global [1 x i32], align 4

	; CHECK: LV: Not vectorizing: Cannot prove legality.			; We can vectorize this loop because we are storing an invariant value into an
				; invariant address.
				AyalUnsubmitted Done Reply Inline Actions CHECK vectorized code emitted, or debug info stating it can be vectorized? Ayal: CHECK vectorized code emitted, or debug info stating it can be vectorized?
	; CHECK-LABEL: @test			; CHECK-LABEL: @test
	define void @test() {			define void @test() {
	entry:			entry:
	%a.promoted2 = load i32, i32* @a, align 1			%a.promoted2 = load i32, i32* @a, align 1
	%c.promoted = load i32, i32* @c, align 1			%c.promoted = load i32, i32* @c, align 1
	br label %for.cond1.preheader			br label %for.cond1.preheader

	for.cond1.preheader: ; preds = %for.cond1.for.inc4_crit_edge, %entry			for.cond1.preheader: ; preds = %for.cond1.for.inc4_crit_edge, %entry
	Show All 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LV][LAA] Vectorize loop invariant values stored into loop invariant addressClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 161525

include/llvm/Analysis/LoopAccessAnalysis.h

lib/Analysis/LoopAccessAnalysis.cpp

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/invariant-store-vectorization.ll

test/Transforms/LoopVectorize/pr31190.ll

[LV][LAA] Vectorize loop invariant values stored into loop invariant address
ClosedPublic