This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
LoopAccessAnalysis.h
-
lib/
-
Analysis/
-
LoopAccessAnalysis.cpp
-
Transforms/Vectorize/
-
Vectorize/
-
LoopVectorizationLegality.cpp
-
LoopVectorize.cpp
-
test/
-
Analysis/LoopAccessAnalysis/
-
LoopAccessAnalysis/
-
memcheck-wrapping-pointers.ll
-
store-to-invariant-check1.ll
-
store-to-invariant-check2.ll
-
store-to-invariant-check3.ll
-
Transforms/LoopVectorize/
-
LoopVectorize/
-
X86/
-
invariant-store-vectorization.ll
-
invariant-store-vectorization.ll
-
pr31190.ll

Differential D50665

[LV][LAA] Vectorize loop invariant values stored into loop invariant address
ClosedPublic

Authored by anna on Aug 13 2018, 2:29 PM.

Download Raw Diff

Details

Reviewers

anemet
Ayal
mkuper
mssimpso

Commits

rGb1e3d4531826: [LV][LAA] Vectorize loop invariant values stored into loop invariant address
rL343028: [LV][LAA] Vectorize loop invariant values stored into loop invariant address

Summary

We are overly conservative in loop vectorizer with respect to stores to loop
invariant addresses.
More details in https://bugs.llvm.org/show_bug.cgi?id=38546
This is the first part of the fix where we start with vectorizing loop invariant
values to loop invariant addresses.

Diff Detail

Repository: rL LLVM

Event Timeline

anna created this revision.Aug 13 2018, 2:29 PM

Harbormaster completed remote builds in B21411: Diff 160448.Aug 13 2018, 2:29 PM

The decision how to vectorize invariant stores also deserves attention: LoopVectorizationCostModel::setCostBasedWideningDecision() considers loads from uniform addresses, but not invariant stores - these may end up being scalarized or becoming a scatter; the former is preferred in this case, as the identical scalarized replicas can later be removed. In any case associated cost estimates should be provided to support overall vectorization costs. Note that vectorizing conditional invariant stores deserves special attention. Unconditional invariant stores are candidates to be sunk out of the loop, preferably before trying to vectorize it. One approach to vectorize a conditional invariant store is to check if its mask is all false, and if not to perform a single invariant scalar store, for lack of a masked-scalar-store instruction. May be worth distinguishing between uniform and divergent conditions; this check is easier to carry out in the former case.

include/llvm/Analysis/LoopAccessAnalysis.h
570 ↗	(On Diff #160448)	This becomes dead?
578 ↗	(On Diff #160448)	AND both indicators?

In D50665#1199777, @Ayal wrote:

Hi Ayal, thanks for the comments!

The decision how to vectorize invariant stores also deserves attention: LoopVectorizationCostModel::setCostBasedWideningDecision() considers loads from uniform addresses, but not invariant stores - these may end up being scalarized or becoming a scatter; the former is preferred in this case, as the identical scalarized replicas can later be removed.

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

In any case associated cost estimates should be provided to support overall vectorization costs.

agreed.

Note that vectorizing conditional invariant stores deserves special attention. Unconditional invariant stores are candidates to be sunk out of the loop, preferably before trying to vectorize it.

If we get unconditional invariant stores which haven't been sunk out of the loop and it has reached vectorizer, I think we should let the loop vectorizer vectorize it. Irrespective of what other passes such as LICM should have done with store promotion/sinking. See example in https://bugs.llvm.org/show_bug.cgi?id=38546#c1. Even running through clang++ O3 doesn't sink the invariant store out of loop and that store prevents the vectorization of entire loop.

One approach to vectorize a conditional invariant store is to check if its mask is all false, and if not to perform a single invariant scalar store, for lack of a masked-scalar-store instruction. May be worth distinguishing between uniform and divergent conditions; this check is easier to carry out in the former case.

Thanks, I thought these were automatically handled. Will address in updated patch.

include/llvm/Analysis/LoopAccessAnalysis.h
570 ↗	(On Diff #160448)	The idea is to retain the identification of `storeToLoopInvariantAddress` if other passes which use LAA need it. That's the reason I separated out the `StoreToLoopInvariantAddress` and `NonVectorizableStoreToLoopInvariantAddress`.
578 ↗	(On Diff #160448)	uh oh. was an older change. will fix.

added cost model changes for unpredicated invariant stores. The predicated invariant stores will
generate extra stores here and the cost model also (already) considers the cost of predicated stores.
Since the cost model correctly reflects the cost of the (badly) generated predicated stores,
I've added couple of tests to show that invariant predicated stores are handled correctly, but TODOs
for follow on patch for better code gen.

Harbormaster completed remote builds in B21535: Diff 160882.Aug 15 2018, 12:00 PM

anna added inline comments.Aug 15 2018, 12:02 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5869 ↗	(On Diff #160882)	Predicated uniform stores will fall under this cost model. The next patch will be to address the improved code gen for this case and update the cost model for predicated uniform stores.

ping

Herald added a subscriber: rkruppe. · View Herald TranscriptAug 20 2018, 8:53 AM

Teach LAA about non-predicated uniform store. Added test case for these cases
to make sure they are not treated as predicated stores.

Harbormaster completed remote builds in B21679: Diff 161525.Aug 20 2018, 11:57 AM

In D50665#1200509, @anna wrote:

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

- by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)

include/llvm/Analysis/LoopAccessAnalysis.h
638 ↗	(On Diff #161525)	Better name it more accurately as, e.g., `VariantStoreToLoopInvariantAddress`?
570 ↗	(On Diff #160448)	OK. But LoopVectorizationLegality below seems to be its only user.
lib/Analysis/LoopAccessAnalysis.cpp
1865 ↗	(On Diff #161525)	`isLoopInvariantStoreValue` ?
1871 ↗	(On Diff #161525)	Again, something LICM may have missed?
lib/Transforms/Vectorize/LoopVectorize.cpp
5754 ↗	(On Diff #161525)	Can use `if (auto *LI = dyn_cast<LoadInst>(I)) {`
5763 ↗	(On Diff #161525)	Indent
5880 ↗	(On Diff #161525)	On certain targets, e.g., skx, an invariant store may end up as a scatter, so setting this decision here to avoid that is important; potentially worthy of a note / a test.

anna marked 4 inline comments as done.Aug 21 2018, 10:03 AM

anna added inline comments.Aug 21 2018, 10:03 AM

include/llvm/Analysis/LoopAccessAnalysis.h
570 ↗	(On Diff #160448)	yes, that's right. I made the change, but the analysis has an ORE and there are 5 tests in the LoopAccessAnalysis that are failing because the ORE check "Store to invariant address was [not] found in loop" is missing. See test/Analysis/LoopAccessAnalysis/store-to-invariant-check1.ll where it looks for the presence of "Store to invariant address was found in loop". I'll remove the code as a follow on clean up and if there's a need for this by other passes that use LAA, folks can add it back when required. I think it also makes sense to add an ORE for the "VariantStoreToInvariantAddress" as part of this current change.
638 ↗	(On Diff #161525)	Okay, I'll change the name. JFI- Today the changed name (VariantStoreToLoopInvariantAddress) is accurate. However, my plan is to eventually teach the vectorizer about all safe uniform stores, not just invariant values stored to invariant address. So `VariantStoreToLoopInvariantAddress` can also be vectorized under certain conditions (safe dependence distance calculation for the store versus other memory access). So something like example below can be vectorized [1]: for (i=0; i<n;i++) for (j=0; j<n; j++) { p[i] = b[j]; z[j] += b[j]; } } However, this cannot be vectorized safely: for (i=0; i<n;i++) for (j=0; j<n; j++) { z[j] = (++p[i]); <-- dependence distance for uniform store and load is 1. } } [1] LICM should try to sink the store out of inner loop, but sometimes it cannot do so because it cannot prove dereferencability for the store address or that the store is guaranteed to execute at least once.
lib/Analysis/LoopAccessAnalysis.cpp
1871 ↗	(On Diff #161525)	yes, LICM misses this as well - see added test case in `inv_val_store_to_inv_address_conditional_inv`.

Addressed review comments, updated ORE message and tests, fixed an assertion failure in cost model calculation for uniform store (bug uncovered when running test
under X86 skylake)

Harbormaster completed remote builds in B21741: Diff 161787.Aug 21 2018, 11:50 AM

anna marked an inline comment as done.Aug 21 2018, 11:53 AM

anna added inline comments.

include/llvm/Analysis/LoopAccessAnalysis.h
570 ↗	(On Diff #160448)	I've made both the changes in this patch since changing the ORE is clearer in one patch.
lib/Transforms/Vectorize/LoopVectorize.cpp
5880 ↗	(On Diff #161525)	thanks for bringing this up. It exercised the `X86TTIImpl::getMemoryOpCost` which showed the bug in my previous diff for `LoopVectorizationCostModel::getUniformMemOpCost` for uniform store. I was passing in the store's type instead of the store val type. I've also updated it to use the "unified" interface for load/store just like the other cost model calculations - `getGatherScatterCost` etc.

In D50665#1206780, @Ayal wrote:

In D50665#1200509, @anna wrote:

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

- by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)

Right now, I just run instcombine after loop vectorization to clean up those unnecessary stores (and test cases make sure there's only one store left). Looks like there are other places in LV which relies on InstCombine as the clean up pass, so it may not be that bad after all? Thoughts?

In D50665#1208026, @anna wrote:

In D50665#1206780, @Ayal wrote:

In D50665#1200509, @anna wrote:

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

- by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)

Right now, I just run instcombine after loop vectorization to clean up those unnecessary stores (and test cases make sure there's only one store left). Looks like there are other places in LV which relies on InstCombine as the clean up pass, so it may not be that bad after all? Thoughts?

Ideally, each optimizer should generate as clean output IR as it can feasibly do so. Cleaning up this particular "mess" is one of the simpler tasks LV can do on its own.

In D50665#1208026, @anna wrote:

In D50665#1206780, @Ayal wrote:

In D50665#1200509, @anna wrote:

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

- by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)

Right now, I just run instcombine after loop vectorization to clean up those unnecessary stores (and test cases make sure there's only one store left). Looks like there are other places in LV which relies on InstCombine as the clean up pass, so it may not be that bad after all? Thoughts?

Yeah, this is a bit embarrassing, but currently invariant loads also get replicated (and cleaned up later), despite trying to avoid doing so by recording IsUniform in VPReplicateRecipe. In general, if it's simpler and more consistent to generate code in a common template and potentially cleanup later, should be ok provided the cost model accounts for it accurately and cleanup is guaranteed, as checked by tests. BTW, LV already has an internal cse(). But in this case, VPlan should reflect the final outcome better, i.e., with a correct IsUniform. This should be taken care of, possibly by a separate patch.

include/llvm/Analysis/LoopAccessAnalysis.h
638 ↗	(On Diff #161525)	Yes, in/variant stores to an invariant address may carry cross-iteration dependencies with other loads/store, which could potentially be checked at runtime similar to 'regular' stores. LV supports reductions/inductions if carried by temporaries only, rather than via memory. Such cases should indeed be LICM'd before vectorization - sinking unconditional stores down to a dominated "middle" block, where it's dereferencable and known to have executed at least once.
568 ↗	(On Diff #161787)	Update above comment as well: "non-vectorizable stores" >> "of variant values"
lib/Analysis/LoopAccessAnalysis.cpp
1871 ↗	(On Diff #161525)	Ahh, but a phi of invariant values is invariant iff the compares that decide which predecessor will reach the phi, are also invariant. In `inv_val_store_to_inv_address_conditional_inv` this holds because there `%cmp` determines which predecessor it'll be, and `%cmp` is invariant. In general Divergence Analysis is designed to provide this answer, as in D50433's `isUniform()`.
lib/Transforms/Vectorize/LoopVectorize.cpp
5880 ↗	(On Diff #161525)	very good
5766 ↗	(On Diff #161787)	Should be consistent and use the same `isLoopInvariantStoreValue()` noted above.
5870 ↗	(On Diff #161787)	We expect here that `isa<LoadInst>(&I) \|\| isa<StoreInst>(&I)` (as `memoryInstructionCanBeWidened()` will assert below) having checked `getLoadStorePointerOperand(&I)` above.

anna marked 3 inline comments as done.Aug 23 2018, 8:36 AM

anna added inline comments.

lib/Analysis/LoopAccessAnalysis.cpp
1871 ↗	(On Diff #161525)	yes, that's right. Note that this patch handles phis of invariant values based on either an invariant condition or a variant condition (see `inv_val_store_to_inv_address_conditional_diff_values_ic` where the phi result is based on a varying condition). The improved codegen and cost model handling is for predicated stores, where the block containing the invariant store is to be predicated. Today, we just handle this as a "predicated store" cost and generate the code gen accordingly.

In D50665#1209958, @Ayal wrote:

In D50665#1208026, @anna wrote:

In D50665#1206780, @Ayal wrote:

In D50665#1200509, @anna wrote:

...

Yes, the stores are scalarized. Identical replicas left as-is. Either passes such as load elimination can remove it, or we can clean it up in LV itself.

- by revisiting LoopVectorizationCostModel::collectLoopUniforms()? ;-)

Right now, I just run instcombine after loop vectorization to clean up those unnecessary stores (and test cases make sure there's only one store left). Looks like there are other places in LV which relies on InstCombine as the clean up pass, so it may not be that bad after all? Thoughts?

Yeah, this is a bit embarrassing, but currently invariant loads also get replicated (and cleaned up later), despite trying to avoid doing so by recording IsUniform in VPReplicateRecipe. In general, if it's simpler and more consistent to generate code in a common template and potentially cleanup later, should be ok provided the cost model accounts for it accurately and cleanup is guaranteed, as checked by tests. BTW, LV already has an internal cse(). But in this case, VPlan should reflect the final outcome better, i.e., with a correct IsUniform. This should be taken care of, possibly by a separate patch.

I see. thanks for the clarification. So, for now, I'll leave the stores in the IR just like we're doing for the loads and add a "TODO" for both.

anna mentioned this in D50925: [LICM] Hoist stores of invariant values to invariant addresses out of loops.Aug 23 2018, 8:54 AM

address review comments (NFC wrt previous diff). Added one test for varying value stored into invariant address.

Harbormaster completed remote builds in B21836: Diff 162206.Aug 23 2018, 9:39 AM

Added TODOs for better code gen of predicated uniform store and removing redundant loads and stores left behind during
scalarization of these uniform loads and stores.

Harbormaster completed remote builds in B21837: Diff 162212.Aug 23 2018, 9:59 AM

Ayal added inline comments.Aug 23 2018, 12:08 PM

lib/Analysis/LoopAccessAnalysis.cpp
1871 ↗	(On Diff #161525)	So the suggested `isLoopInvariantStoreValue` name is incorrect, as the store value may be variant. What's special about these variant values - why not handle any store value? Yes, conditional stores to an invariant address will end up scalarized and predicated, i.e., with a branch-and-store per lane, which is quite inefficient. A masked scatter may work better there, until optimized by a single branch-and-store if any lane is masked-on (invariant stored value) or single branch-and-store of last masked-on lane (in/variant stored value).
lib/Transforms/Vectorize/LoopVectorize.cpp
5766 ↗	(On Diff #161787)	This is still inconsistent with the OR-operands-are-invariant above.

anna added inline comments.Aug 23 2018, 12:32 PM

lib/Analysis/LoopAccessAnalysis.cpp
1871 ↗	(On Diff #161525)	I am currently adding the support for any variant stores to invariant address. The unsafe cross iteration dependencies are identified through `LAA: unsafe dependent memory operations in loop`. This was identified without any changes required from my side to the LAA memory conflict detection. However, I'm not sure if LAA handles all cases exhaustively. The reason I started with this sub-patch is that when the stored value is not a varying memory access from within the loop (that's what this blob of code is really trying to do - see `isLoopInvariant` and `hasLoopInvariantOperands`), we don't need to reason about whether LAA handles all memory conflict detection. When the stored value can be any variant value, we need to make sure that the LAA pass handles all the memory conflicts correctly or update LAA if that isn't the case.
lib/Transforms/Vectorize/LoopVectorize.cpp
5766 ↗	(On Diff #161787)	will update.

anna added inline comments.Aug 23 2018, 1:27 PM

lib/Transforms/Vectorize/LoopVectorize.cpp

5766 ↗

(On Diff #161787)

actually, this is correct. We don't need to update it to the "incorrectly named" lambda above.

We need to do an extract if the value is not invariant: example case:

for.body:                                         ; preds = %for.body, %entry
  %i = phi i64 [ %i.next, %latch ], [ 0, %entry ]
  %tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
  %tmp2 = load i32, i32* %tmp1, align 8
  %varying_cmp = icmp eq i32 %tmp2, %k
  store i32 %ntrunc, i32* %tmp1
  br i1 %varying_cmp, label %cond_store, label %cond_store_k

cond_store:
  br label %latch

cond_store_k:
  br label %latch

latch:
  %storeval = phi i32 [ %ntrunc, %cond_store ], [ %k, %cond_store_k ]
  store i32 %storeval, i32* %a <-- uniform store

storeval's operands are invariant, but the value being chosen in each iteration of the loop varies based on %varying_cmp. In this case, we need an extract and then the scalar store. That's exactly what we do as well.

okay, to keep this patch true to the original intent and commit message: I'm going to change it to handle just the store of invariant values to invariant addresses (i.e. no support for OR-operands-are-invariant). It will be admittedly a more conservative patch. The ORE message will also reflect correctly the "variant stores to invariant addresses".

Also, the more general patch which is in progress is to handle the store of any (in/variant) value into invariant address. It requires handling of a UB case: when user incorrectly annotates a loop which has memory conflicts as parallel.loop and the vectorizer vectorizes the loop with store to uniform address (but the loop has a memory conflict). There was a bug fixed a while back: https://bugs.llvm.org/show_bug.cgi?id=15794#c4
As part of this more general patch, the ORE about uniform stores will also be removed, since we don't seem to need it (or we can keep the original code around if needed).

One more interesting thing I noticed while adding predicated invariant stores to X86 (for -mcpu=skylake-avx512), it supports masked scatter for non-unniform stores.
But we need to add support for uniform stores along with this patch. Today, it just generates incorrect code (no predication whatsover).
For other architectures that do not have these masked intrinsics, we just generate the predicated store by doing an extract and branch on each lane (correct but inefficient and will be avoided unless -force-vector-width=X).

see comment above for masked scatter support.

In D50665#1212597, @anna wrote:

One more interesting thing I noticed while adding predicated invariant stores to X86 (for -mcpu=skylake-avx512), it supports masked scatter for non-unniform stores.
But we need to add support for uniform stores along with this patch. Today, it just generates incorrect code (no predication whatsover).
For other architectures that do not have these masked intrinsics, we just generate the predicated store by doing an extract and branch on each lane (correct but inefficient and will be avoided unless -force-vector-width=X).

In general, self output dependence is fine to vectorize (whether the store address is uniform or random), as long as (masked) scatter (or scatter emulation) happens from lower elements to higher elements. Intel's scatter instruction is implemented in that way, and so is CG Prepare's serialization of masked scatter intrinsic. When we check for TTI based availability/cost, we need to ensure that the HW scatter support satisfies this ordering requirement since some scatter implementations may not.

In D50665#1212637, @hsaito wrote:

In D50665#1212597, @anna wrote:

One more interesting thing I noticed while adding predicated invariant stores to X86 (for -mcpu=skylake-avx512), it supports masked scatter for non-unniform stores.
But we need to add support for uniform stores along with this patch. Today, it just generates incorrect code (no predication whatsover).
For other architectures that do not have these masked intrinsics, we just generate the predicated store by doing an extract and branch on each lane (correct but inefficient and will be avoided unless -force-vector-width=X).

In general, self output dependence is fine to vectorize (whether the store address is uniform or random), as long as (masked) scatter (or scatter emulation) happens from lower elements to higher elements.

I don't think the above comment matters for uniform addresses because a uniform address is invariant. This is what the langref states for scatter intrinsic (https://llvm.org/docs/LangRef.html#id1792):

. The data stored in memory is a vector of any integer, floating-point or pointer data type. Each vector element is stored in an arbitrary memory address. Scatter with overlapping addresses is guaranteed to be ordered from least-significant to most-significant element.

The scatter address is not overlapping for the uniform address. It is the exact same address. This is the code that gets generated for uniform stores on skylake with AVX-512 support once I fixed the bug in this patch (the scatter location is the same address and the stored value is also the same, and the mask is the vector of booleans):
pseudo code:

if (b[i] ==k)
  a = ntrunc; <-- uniform store based on condition above.

IR generated:

vector.ph:
  %broadcast.splatinsert5 = insertelement <16 x i32> undef, i32 %k, i32 0
  %broadcast.splat6 = shufflevector <16 x i32> %broadcast.splatinsert5, <16 x i32> undef, <16 x i32> zeroinitializer <-- vector splat of k
  %broadcast.splatinsert9 = insertelement <16 x i32*> undef, i32* %a, i32 0
  %broadcast.splat10 = shufflevector <16 x i32*> %broadcast.splatinsert9, <16 x i32*> undef, <16 x i32> zeroinitializer <-- vector splat of i32* a.

vector.body:
 %2 = getelementptr inbounds i32, i32* %b, i64 %index
  %3 = bitcast i32* %2 to <16 x i32>*
  %wide.load = load <16 x i32>, <16 x i32>* %3, align 8
  %4 = icmp eq <16 x i32> %wide.load, %broadcast.splat6
call void @llvm.masked.scatter.v16i32.v16p0i32(<16 x i32> %broadcast.splat8, <16 x i32*> %broadcast.splat10, i32 4, <16 x i1> %4) <--scatter storing the same element into the same address (a), depending on same condition b[i] == k

For other architectures that do not have these masked intrinsics, we just generate the predicated store by doing an extract and branch on each lane (correct but inefficient and will be avoided unless -force-vector-width=X).

In general, self output dependence is fine to vectorize (whether the store address is uniform or random), as long as (masked) scatter (or scatter emulation) happens from lower elements to higher elements.

I don't think the above comment matters for uniform addresses because a uniform address is invariant.

Only if you are storing uniform value.

This is what the langref states for scatter intrinsic (https://llvm.org/docs/LangRef.html#id1792):

. The data stored in memory is a vector of any integer, floating-point or pointer data type. Each vector element is stored in an arbitrary memory address. Scatter with overlapping addresses is guaranteed to be ordered from least-significant to most-significant element.

Thanks for reminding me that the intrinsic is defined with the ordering requirement.

We should also consider doing this, depending on the cost of branch versus masked scatter. For the targets w/o masked scatter, this should be better than masked scatter emulation.

%5 = bitcast <16xi1> %4 to <i16>
%6 = icmp eq <i16> %5, <i16> zero
br <i1> %6 skip fall
fall:
store <i32> %ntrunc, <i32*> %a
br skip
skip:

In D50665#1212899, @hsaito wrote:

We should also consider doing this, depending on the cost of branch versus masked scatter. For the targets w/o masked scatter, this should be better than masked scatter emulation.

%5 = bitcast <16xi1> %4 to <i16>
%6 = icmp eq <i16> %5, <i16> zero
br <i1> %6 skip fall
fall:
store <i32> %ntrunc, <i32*> %a
br skip
skip:

Yes, that is the improved codegen stated as TODO in the costmodel. Today both the costmodel and the code gen will identify it as a normal predicated store: series of branches and stores. Also, we need to differentiate these 2 cases:

if(b[i] ==k)
 a = ntrunc;

versus

if(b[i] ==k)
  a = ntrunc;
else
  a = m;

The second example should be converted into a vector-select based on b[i] == k and the last element will be extracted out of the vector select and stored into a.
However, if for some reason, it is not converted into a select and just left as 2 predicated stores, it is incorrect to use the same code transformation as we'll do for the first example. For the first example, we see if all values in the conditional is false, and we skip the store. In the second case, we need to store a value, but that value is just decided by the last element of the conditional. Just 2 different forms of predicated stores.

Yes, that is the improved codegen stated as TODO in the costmodel.

Aha. OK. Thanks for the clarification.

In D50665#1212872, @anna wrote:
This is what the langref states for scatter intrinsic (https://llvm.org/docs/LangRef.html#id1792):
. The data stored in memory is a vector of any integer, floating-point or pointer data type. Each vector element is stored in an arbitrary memory address. Scatter with overlapping addresses is guaranteed to be ordered from least-significant to most-significant element.

Yes, this was intentional, precisely to support vectorization of (possibly) self-overwriting stores.

...
This is the code that gets generated for uniform stores on skylake with AVX-512 support once I fixed the bug in this patch
...

LGTM.
Indeed, care must be taken to avoid using more than one masked scatter to the same invariant address; but LAA should flag such non-self cross-iteration dependencies.

include/llvm/Analysis/LoopAccessAnalysis.h
570 ↗	(On Diff #162212)	Could rename `VariantStoreToLoopInvariantAddress` to `HasVariantStoreToLoopInvariantAddress`.
lib/Analysis/LoopAccessAnalysis.cpp
1871 ↗	(On Diff #161525)	It is conceivable that stores of invariant values to invariant addresses can participate in a subset of unsafe scenarios, which may be easier for LAA to detect, and thus start by treating only stores of invariant values to invariant addresses. But storing a variant phi whose "dominating" compares are not all invariant, could conceptually produce arbitrary variant values and dependencies, despite having invariant values for all other operands of the phi; e.g., 0 and -1. Presumably, this case does not differ, from LAA perspective, from stores of any variant value to invariant address.
lib/Transforms/Vectorize/LoopVectorize.cpp
5766 ↗	(On Diff #161787)	Agreed. Misled by the erroneous isLoopInvariantStoreValue() name.

anna mentioned this in D51313: [LV] Fix code gen for conditionally executed uniform loads.Aug 27 2018, 9:19 AM

added test for conditional uniform store for AVX512. Rebased over fix in D51313.

Harbormaster completed remote builds in B22092: Diff 163328.Aug 30 2018, 7:51 AM

This patch now only vectorizes invariant values stored into invariant addresses. It also correctly handles conditionally executed stores (fixed bug for scatter code generation in AVX512).

rebased over D51313.

ping

Harbormaster completed remote builds in B22422: Diff 164667.Sep 10 2018, 7:18 AM

Best allow only a single store to an invariant address for now; until we're sure the last one to store is always identified correctly.

include/llvm/Analysis/LoopAccessAnalysis.h
568 ↗	(On Diff #164667)	/// Checks existence of stores to invariant address inside loop. /// If such stores exist, checks if those are stores of variant values. can be updated and simplified into something like /// If the loop has any store of a variant value to an invariant address, then return true, else return false.
lib/Analysis/LoopAccessAnalysis.cpp
1869 ↗	(On Diff #164667)	How about `isUniform(Ptr) && !isUniform(ST->getValueOperand())` ? Relying more consistently on SCEV to determine invariance of both address and stored value. Is there a reason for treating stored value more conservatively, checking its invariance by asking if it's outside the loop?
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
780 ↗	(On Diff #164667)	update the message as well: "write of variant value to a loop invariant address ..."
lib/Transforms/Vectorize/LoopVectorize.cpp
5890 ↗	(On Diff #164667)	Complementing the consistent use of isUniform rather than isLoopInvariant: `bool isLoopInvariantStoreValue = Legal->isUniform(SI->getValueOperand());` ? , similar to the way the address is checked to be uniform before calling this method below.
6008 ↗	(On Diff #164667)	Comment can be simplified to something like // TODO: Avoid replicating loads and stores instead of // relying on instcombine to remove them.
test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
16 ↗	(On Diff #164667)	Have one space instead of two between i32 and %ntrunc on the check-not'd store. Easier to see that this checks for a single copy of the store, i.e., that instcombine eliminated all redundant copies. May want to comment what this test is designed to check.
131 ↗	(On Diff #164667)	inv_val_load_to?
test/Transforms/LoopVectorize/invariant-store-vectorization.ll
11 ↗	(On Diff #164667)	"that check whether" >> "check that" ("whether" usually comes with an "or not")
80 ↗	(On Diff #164667)	"as identifying these" >> "identify them" Do we check what the cost model identifies?
132 ↗	(On Diff #164667)	Hmm, multiple stores to the same invariant address did not trigger LAI memory dependence checks(?) This may generate wrong code if the conditional scalarized stores are emitted in the wrong order, or if a pair of masked scatters are used.
151 ↗	(On Diff #164667)	good to continue CHECKing that EE1 is used in a branch that guards a store of %ntrunc to %a.
182 ↗	(On Diff #164667)	Now the store/s is/are no longer of invariant value/s.
183 ↗	(On Diff #164667)	.. once we support vectorizing stores of variant values to invariant addresses
219 ↗	(On Diff #164667)	.. efficiently once divergence analysis identifies storeval as uniform
252 ↗	(On Diff #164667)	"even though it's" >> "once we support"
test/Transforms/LoopVectorize/pr31190.ll
33 ↗	(On Diff #164667)	CHECK vectorized code emitted, or debug info stating it can be vectorized?

Hi Ayal, thanks for your detailed review!

In D50665#1231754, @Ayal wrote:

Best allow only a single store to an invariant address for now; until we're sure the last one to store is always identified correctly.

I've updated the patch to restrict to this case for now (diff coming up soon). Generally, if we have multiple stores to an invariant address, it might be canonicalized by InstCombine. So, this may not be as inhibiting as it sounds. Keeping this restriction and allowing "variant stores to invariant addresses" seems like a logical next step once this lands.

lib/Analysis/LoopAccessAnalysis.cpp
1869 ↗	(On Diff #164667)	Nothing specific. This works as well. I've changed it. As a separate change, we'll need to improve `isUniform` because they consider uniform FP values are non-uniform (since FP is non-scevable).
test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
131 ↗	(On Diff #164667)	updated name.
test/Transforms/LoopVectorize/invariant-store-vectorization.ll
80 ↗	(On Diff #164667)	since we dont have debug statements for what the cost model identifies this, I've updated the above comment.
132 ↗	(On Diff #164667)	good point - as stated in comment earlier, I will restrict to one store to invariant address for now.
219 ↗	(On Diff #164667)	once we relax the check of variant/invariant value being stored, it does not matter if we correctly identify if it is variant or invariant. So, I think divergence analysis is not required.

addressed review comments.

Harbormaster completed remote builds in B22814: Diff 166025.Sep 18 2018, 1:40 PM

ping

Thanks for taking care of everything, this LGTM now, added only a few minor optional comments.

lib/Analysis/LoopAccessAnalysis.cpp
1883 ↗	(On Diff #166025)	Maybe clearer to do if (isUniform(Ptr)) { // Consider multiple stores to the same uniform address as a store of a variant value. bool MultipleStoresToUniformPtr = UniformStores.insert(Ptr).second; HasVariantStoreToLoopInvariantAddress \|= (!isUniform(ST->getValueOperand()) \|\| MultipleStoresToUniformPtr); } Note that supporting a single store of a variant value to an invariant address is easier than supporting multiple (conditional) stores of invariant values to an invariant address, as discussed. So the two conditions should probably be separated when the patch taking care of the former is introduced.
lib/Transforms/Vectorize/LoopVectorize.cpp
1180 ↗	(On Diff #166025)	The ": extract of last element" part is for future use, when stores of variant values to invariant addresses are supported, right? Best leave this part to that future patch, or add a TODO to test this extra cost then.
5422 ↗	(On Diff #166025)	No need to add these enclosing curly brackets.
test/Transforms/LoopVectorize/invariant-store-vectorization.ll
219 ↗	(On Diff #164667)	ok, works both ways - once we leverage divergence analysis we'll be able to handle such a store of uniform/invariant value, w/o needing relaxed support for stores of variant values.

This revision is now accepted and ready to land.Sep 24 2018, 2:52 PM

anna marked 3 inline comments as done.Sep 25 2018, 8:30 AM

anna added inline comments.

lib/Analysis/LoopAccessAnalysis.cpp
1883 ↗	(On Diff #166025)	done. Should be `bool MultipleStoresToUniformPtr = !UniformStores.insert(Ptr).second;`

Closed by commit rL343028: [LV][LAA] Vectorize loop invariant values stored into loop invariant address (authored by annat). · Explain WhySep 25 2018, 1:58 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Analysis/

LoopAccessAnalysis.h

14 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

21 lines

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

5 lines

LoopVectorize.cpp

40 lines

test/

Analysis/

LoopAccessAnalysis/

memcheck-wrapping-pointers.ll

2 lines

store-to-invariant-check1.ll

8 lines

store-to-invariant-check2.ll

4 lines

store-to-invariant-check3.ll

2 lines

Transforms/

LoopVectorize/

X86/

invariant-store-vectorization.ll

132 lines

invariant-store-vectorization.ll

260 lines

pr31190.ll

5 lines

Diff 166987

llvm/trunk/include/llvm/Analysis/LoopAccessAnalysis.h

Show First 20 Lines • Show All 558 Lines • ▼ Show 20 Lines	public:
const ValueToValueMap &getSymbolicStrides() const { return SymbolicStrides; }		const ValueToValueMap &getSymbolicStrides() const { return SymbolicStrides; }

/// Pointer has a symbolic stride.		/// Pointer has a symbolic stride.
bool hasStride(Value *V) const { return StrideSet.count(V); }		bool hasStride(Value *V) const { return StrideSet.count(V); }

/// Print the information about the memory accesses in the loop.		/// Print the information about the memory accesses in the loop.
void print(raw_ostream &OS, unsigned Depth = 0) const;		void print(raw_ostream &OS, unsigned Depth = 0) const;

/// Checks existence of store to invariant address inside loop.		/// If the loop has any store of a variant value to an invariant address, then
/// If the loop has any store to invariant address, then it returns true,		/// return true, else return false.
/// else returns false.		bool hasVariantStoreToLoopInvariantAddress() const {
bool hasStoreToLoopInvariantAddress() const {		return HasVariantStoreToLoopInvariantAddress;
return StoreToLoopInvariantAddress;
}		}

/// Used to add runtime SCEV checks. Simplifies SCEV expressions and converts		/// Used to add runtime SCEV checks. Simplifies SCEV expressions and converts
/// them to a more usable form. All SCEV expressions during the analysis		/// them to a more usable form. All SCEV expressions during the analysis
/// should be re-written (and therefore simplified) according to PSE.		/// should be re-written (and therefore simplified) according to PSE.
/// A user of LoopAccessAnalysis will need to emit the runtime checks		/// A user of LoopAccessAnalysis will need to emit the runtime checks
/// associated with this predicate.		/// associated with this predicate.
const PredicatedScalarEvolution &getPSE() const { return *PSE; }		const PredicatedScalarEvolution &getPSE() const { return *PSE; }
Show All 36 Lines	private:
unsigned NumLoads;		unsigned NumLoads;
unsigned NumStores;		unsigned NumStores;

uint64_t MaxSafeDepDistBytes;		uint64_t MaxSafeDepDistBytes;

/// Cache the result of analyzeLoop.		/// Cache the result of analyzeLoop.
bool CanVecMem;		bool CanVecMem;

/// Indicator for storing to uniform addresses.		/// Indicator that there is a store of a variant value to a uniform address.
/// If a loop has write to a loop invariant address then it should be true.		bool HasVariantStoreToLoopInvariantAddress;
bool StoreToLoopInvariantAddress;

/// The diagnostics report generated for the analysis. E.g. why we		/// The diagnostics report generated for the analysis. E.g. why we
/// couldn't analyze the loop.		/// couldn't analyze the loop.
std::unique_ptr<OptimizationRemarkAnalysis> Report;		std::unique_ptr<OptimizationRemarkAnalysis> Report;

/// If an access has a symbolic strides, this maps the pointer value to		/// If an access has a symbolic strides, this maps the pointer value to
/// the stride symbol.		/// the stride symbol.
ValueToValueMap SymbolicStrides;		ValueToValueMap SymbolicStrides;
▲ Show 20 Lines • Show All 129 Lines • Show Last 20 Lines

llvm/trunk/lib/Analysis/LoopAccessAnalysis.cpp

Show First 20 Lines • Show All 1,856 Lines • ▼ Show 20 Lines	void LoopAccessInfo::analyzeLoop(AliasAnalysis AA, LoopInfo LI,

// Holds the analyzed pointers. We don't want to call GetUnderlyingObjects		// Holds the analyzed pointers. We don't want to call GetUnderlyingObjects
// multiple times on the same object. If the ptr is accessed twice, once		// multiple times on the same object. If the ptr is accessed twice, once
// for read and once for write, it will only appear once (on the write		// for read and once for write, it will only appear once (on the write
// list). This is okay, since we are going to check for conflicts between		// list). This is okay, since we are going to check for conflicts between
// writes and between reads and writes, but not between reads and reads.		// writes and between reads and writes, but not between reads and reads.
ValueSet Seen;		ValueSet Seen;

		// Record uniform store addresses to identify if we have multiple stores
		// to the same address.
		ValueSet UniformStores;

for (StoreInst *ST : Stores) {		for (StoreInst *ST : Stores) {
Value *Ptr = ST->getPointerOperand();		Value *Ptr = ST->getPointerOperand();
// Check for store to loop invariant address.
StoreToLoopInvariantAddress \|= isUniform(Ptr);		if (isUniform(Ptr)) {
		// Consider multiple stores to the same uniform address as a store of a
		// variant value.
		bool MultipleStoresToUniformPtr = !UniformStores.insert(Ptr).second;
		HasVariantStoreToLoopInvariantAddress \|=
		(!isUniform(ST->getValueOperand()) \|\| MultipleStoresToUniformPtr);
		}

// If we did not see this pointer before, insert it to the read-write		// If we did not see this pointer before, insert it to the read-write
// list. At this phase it is only a 'write' list.		// list. At this phase it is only a 'write' list.
if (Seen.insert(Ptr).second) {		if (Seen.insert(Ptr).second) {
++NumReadWrites;		++NumReadWrites;

MemoryLocation Loc = MemoryLocation::get(ST);		MemoryLocation Loc = MemoryLocation::get(ST);
// The TBAA metadata could have a control dependency on the predication		// The TBAA metadata could have a control dependency on the predication
// condition, so we cannot rely on it when determining whether or not we		// condition, so we cannot rely on it when determining whether or not we
▲ Show 20 Lines • Show All 383 Lines • ▼ Show 20 Lines

LoopAccessInfo::LoopAccessInfo(Loop L, ScalarEvolution SE,		LoopAccessInfo::LoopAccessInfo(Loop L, ScalarEvolution SE,
const TargetLibraryInfo TLI, AliasAnalysis AA,		const TargetLibraryInfo TLI, AliasAnalysis AA,
DominatorTree DT, LoopInfo LI)		DominatorTree DT, LoopInfo LI)
: PSE(llvm::make_unique<PredicatedScalarEvolution>(SE, L)),		: PSE(llvm::make_unique<PredicatedScalarEvolution>(SE, L)),
PtrRtChecking(llvm::make_unique<RuntimePointerChecking>(SE)),		PtrRtChecking(llvm::make_unique<RuntimePointerChecking>(SE)),
DepChecker(llvm::make_unique<MemoryDepChecker>(*PSE, L)), TheLoop(L),		DepChecker(llvm::make_unique<MemoryDepChecker>(*PSE, L)), TheLoop(L),
NumLoads(0), NumStores(0), MaxSafeDepDistBytes(-1), CanVecMem(false),		NumLoads(0), NumStores(0), MaxSafeDepDistBytes(-1), CanVecMem(false),
StoreToLoopInvariantAddress(false) {		HasVariantStoreToLoopInvariantAddress(false) {
if (canAnalyzeLoop())		if (canAnalyzeLoop())
analyzeLoop(AA, LI, TLI, DT);		analyzeLoop(AA, LI, TLI, DT);
}		}

void LoopAccessInfo::print(raw_ostream &OS, unsigned Depth) const {		void LoopAccessInfo::print(raw_ostream &OS, unsigned Depth) const {
if (CanVecMem) {		if (CanVecMem) {
OS.indent(Depth) << "Memory dependences are safe";		OS.indent(Depth) << "Memory dependences are safe";
if (MaxSafeDepDistBytes != -1ULL)		if (MaxSafeDepDistBytes != -1ULL)
Show All 15 Lines	if (auto *Dependences = DepChecker->getDependences()) {
}		}
} else		} else
OS.indent(Depth) << "Too many dependences, not recorded\n";		OS.indent(Depth) << "Too many dependences, not recorded\n";

// List the pair of accesses need run-time checks to prove independence.		// List the pair of accesses need run-time checks to prove independence.
PtrRtChecking->print(OS, Depth);		PtrRtChecking->print(OS, Depth);
OS << "\n";		OS << "\n";

OS.indent(Depth) << "Store to invariant address was "		OS.indent(Depth) << "Variant Store to invariant address was "
<< (StoreToLoopInvariantAddress ? "" : "not ")		<< (HasVariantStoreToLoopInvariantAddress ? "" : "not ")
<< "found in loop.\n";		<< "found in loop.\n";

OS.indent(Depth) << "SCEV assumptions:\n";		OS.indent(Depth) << "SCEV assumptions:\n";
PSE->getUnionPredicate().print(OS, Depth);		PSE->getUnionPredicate().print(OS, Depth);

OS << "\n";		OS << "\n";

OS.indent(Depth) << "Expressions re-written:\n";		OS.indent(Depth) << "Expressions re-written:\n";
▲ Show 20 Lines • Show All 68 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show First 20 Lines • Show All 811 Lines • ▼ Show 20 Lines	if (LAR) {
ORE->emit([&]() {		ORE->emit([&]() {
return OptimizationRemarkAnalysis(Hints->vectorizeAnalysisPassName(),		return OptimizationRemarkAnalysis(Hints->vectorizeAnalysisPassName(),
"loop not vectorized: ", *LAR);		"loop not vectorized: ", *LAR);
});		});
}		}
if (!LAI->canVectorizeMemory())		if (!LAI->canVectorizeMemory())
return false;		return false;

if (LAI->hasStoreToLoopInvariantAddress()) {		if (LAI->hasVariantStoreToLoopInvariantAddress()) {
ORE->emit(createMissedAnalysis("CantVectorizeStoreToLoopInvariantAddress")		ORE->emit(createMissedAnalysis("CantVectorizeStoreToLoopInvariantAddress")
<< "write to a loop invariant address could not be vectorized");		<< "write of variant value to a loop invariant address could not "
		"be vectorized");
LLVM_DEBUG(dbgs() << "LV: We don't allow storing to uniform addresses\n");		LLVM_DEBUG(dbgs() << "LV: We don't allow storing to uniform addresses\n");
return false;		return false;
}		}

Requirements->addRuntimePointerChecks(LAI->getNumRuntimePointerChecks());		Requirements->addRuntimePointerChecks(LAI->getNumRuntimePointerChecks());
PSE.addPredicate(LAI->getPSE().getUnionPredicate());		PSE.addPredicate(LAI->getPSE().getUnionPredicate());

return true;		return true;
▲ Show 20 Lines • Show All 305 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,168 Lines • ▼ Show 20 Lines	private:

/// The cost computation for Gather/Scatter instruction.		/// The cost computation for Gather/Scatter instruction.
unsigned getGatherScatterCost(Instruction *I, unsigned VF);		unsigned getGatherScatterCost(Instruction *I, unsigned VF);

/// The cost computation for widening instruction \p I with consecutive		/// The cost computation for widening instruction \p I with consecutive
/// memory access.		/// memory access.
unsigned getConsecutiveMemOpCost(Instruction *I, unsigned VF);		unsigned getConsecutiveMemOpCost(Instruction *I, unsigned VF);

/// The cost calculation for Load instruction \p I with uniform pointer -		/// The cost calculation for Load/Store instruction \p I with uniform pointer -
/// scalar load + broadcast.		/// Load: scalar load + broadcast.
		/// Store: scalar store + (loop invariant value stored? 0 : extract of last
		/// element)
		/// TODO: Test the extra cost of the extract when loop variant value stored.
unsigned getUniformMemOpCost(Instruction *I, unsigned VF);		unsigned getUniformMemOpCost(Instruction *I, unsigned VF);

/// Returns whether the instruction is a load or store and will be a emitted		/// Returns whether the instruction is a load or store and will be a emitted
/// as a vector operation.		/// as a vector operation.
bool isConsecutiveLoadOrStore(Instruction *I);		bool isConsecutiveLoadOrStore(Instruction *I);

/// Returns true if an artificially high cost for emulated masked memrefs		/// Returns true if an artificially high cost for emulated masked memrefs
/// should be used.		/// should be used.
▲ Show 20 Lines • Show All 4,105 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::getConsecutiveMemOpCost(Instruction *I,
bool Reverse = ConsecutiveStride < 0;		bool Reverse = ConsecutiveStride < 0;
if (Reverse)		if (Reverse)
Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);		Cost += TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
return Cost;		return Cost;
}		}

unsigned LoopVectorizationCostModel::getUniformMemOpCost(Instruction *I,		unsigned LoopVectorizationCostModel::getUniformMemOpCost(Instruction *I,
unsigned VF) {		unsigned VF) {
LoadInst *LI = cast<LoadInst>(I);		Type *ValTy = getMemInstValueType(I);
Type *ValTy = LI->getType();
Type *VectorTy = ToVectorTy(ValTy, VF);		Type *VectorTy = ToVectorTy(ValTy, VF);
unsigned Alignment = LI->getAlignment();		unsigned Alignment = getLoadStoreAlignment(I);
unsigned AS = LI->getPointerAddressSpace();		unsigned AS = getLoadStoreAddressSpace(I);
		if (isa<LoadInst>(I)) {
return TTI.getAddressComputationCost(ValTy) +		return TTI.getAddressComputationCost(ValTy) +
TTI.getMemoryOpCost(Instruction::Load, ValTy, Alignment, AS) +		TTI.getMemoryOpCost(Instruction::Load, ValTy, Alignment, AS) +
TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, VectorTy);		TTI.getShuffleCost(TargetTransformInfo::SK_Broadcast, VectorTy);
}		}
		StoreInst *SI = cast<StoreInst>(I);

		bool isLoopInvariantStoreValue = Legal->isUniform(SI->getValueOperand());
		return TTI.getAddressComputationCost(ValTy) +
		TTI.getMemoryOpCost(Instruction::Store, ValTy, Alignment, AS) +
		(isLoopInvariantStoreValue ? 0 : TTI.getVectorInstrCost(
		Instruction::ExtractElement,
		VectorTy, VF - 1));
		}

unsigned LoopVectorizationCostModel::getGatherScatterCost(Instruction *I,		unsigned LoopVectorizationCostModel::getGatherScatterCost(Instruction *I,
unsigned VF) {		unsigned VF) {
Type *ValTy = getMemInstValueType(I);		Type *ValTy = getMemInstValueType(I);
Type *VectorTy = ToVectorTy(ValTy, VF);		Type *VectorTy = ToVectorTy(ValTy, VF);
unsigned Alignment = getLoadStoreAlignment(I);		unsigned Alignment = getLoadStoreAlignment(I);
Value *Ptr = getLoadStorePointerOperand(I);		Value *Ptr = getLoadStorePointerOperand(I);

▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::setCostBasedWideningDecision(unsigned VF) {
NumPredStores = 0;		NumPredStores = 0;
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
// For each instruction in the old loop.		// For each instruction in the old loop.
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
Value *Ptr = getLoadStorePointerOperand(&I);		Value *Ptr = getLoadStorePointerOperand(&I);
if (!Ptr)		if (!Ptr)
continue;		continue;

		// TODO: We should generate better code and update the cost model for
		// predicated uniform stores. Today they are treated as any other
		// predicated store (see added test cases in
		// invariant-store-vectorization.ll).
if (isa<StoreInst>(&I) && isScalarWithPredication(&I))		if (isa<StoreInst>(&I) && isScalarWithPredication(&I))
NumPredStores++;		NumPredStores++;

if (isa<LoadInst>(&I) && Legal->isUniform(Ptr) &&		if (Legal->isUniform(Ptr) &&
// Conditional loads should be scalarized and predicated.		// Conditional loads and stores should be scalarized and predicated.
// isScalarWithPredication cannot be used here since masked		// isScalarWithPredication cannot be used here since masked
// gather/scatters are not considered scalar with predication.		// gather/scatters are not considered scalar with predication.
!Legal->blockNeedsPredication(I.getParent())) {		!Legal->blockNeedsPredication(I.getParent())) {
// Scalar load + broadcast		// TODO: Avoid replicating loads and stores instead of
		// relying on instcombine to remove them.
		// Load: Scalar load + broadcast
		// Store: Scalar store + isLoopInvariantStoreValue ? 0 : extract
unsigned Cost = getUniformMemOpCost(&I, VF);		unsigned Cost = getUniformMemOpCost(&I, VF);
setWideningDecision(&I, VF, CM_Scalarize, Cost);		setWideningDecision(&I, VF, CM_Scalarize, Cost);
continue;		continue;
}		}

// We assume that widening is the best solution when possible.		// We assume that widening is the best solution when possible.
if (memoryInstructionCanBeWidened(&I, VF)) {		if (memoryInstructionCanBeWidened(&I, VF)) {
unsigned Cost = getConsecutiveMemOpCost(&I, VF);		unsigned Cost = getConsecutiveMemOpCost(&I, VF);
▲ Show 20 Lines • Show All 1,844 Lines • Show Last 20 Lines

llvm/trunk/test/Analysis/LoopAccessAnalysis/memcheck-wrapping-pointers.ll

	Show All 33 Lines
	; CHECK-NEXT: %arrayidx4 = getelementptr inbounds i32, i32* %b, i64 %conv11			; CHECK-NEXT: %arrayidx4 = getelementptr inbounds i32, i32* %b, i64 %conv11
	; CHECK-NEXT: Grouped accesses:			; CHECK-NEXT: Grouped accesses:
	; CHECK-NEXT: Group			; CHECK-NEXT: Group
	; CHECK-NEXT: (Low: (4 + %a) High: (4 + (4 * (1 umax %x)) + %a))			; CHECK-NEXT: (Low: (4 + %a) High: (4 + (4 * (1 umax %x)) + %a))
	; CHECK-NEXT: Member: {(4 + %a),+,4}<%for.body>			; CHECK-NEXT: Member: {(4 + %a),+,4}<%for.body>
	; CHECK-NEXT: Group			; CHECK-NEXT: Group
	; CHECK-NEXT: (Low: %b High: ((4 * (1 umax %x)) + %b))			; CHECK-NEXT: (Low: %b High: ((4 * (1 umax %x)) + %b))
	; CHECK-NEXT: Member: {%b,+,4}<%for.body>			; CHECK-NEXT: Member: {%b,+,4}<%for.body>
	; CHECK: Store to invariant address was not found in loop.			; CHECK: Variant Store to invariant address was not found in loop.
	; CHECK-NEXT: SCEV assumptions:			; CHECK-NEXT: SCEV assumptions:
	; CHECK-NEXT: {1,+,1}<%for.body> Added Flags: <nusw>			; CHECK-NEXT: {1,+,1}<%for.body> Added Flags: <nusw>
	; CHECK-NEXT: {0,+,1}<%for.body> Added Flags: <nusw>			; CHECK-NEXT: {0,+,1}<%for.body> Added Flags: <nusw>
	; CHECK: Expressions re-written:			; CHECK: Expressions re-written:
	; CHECK-NEXT: [PSE] %arrayidx = getelementptr inbounds i32, i32* %a, i64 %idxprom:			; CHECK-NEXT: [PSE] %arrayidx = getelementptr inbounds i32, i32* %a, i64 %idxprom:
	; CHECK-NEXT: ((4 * (zext i32 {1,+,1}<%for.body> to i64))<nuw><nsw> + %a)<nsw>			; CHECK-NEXT: ((4 * (zext i32 {1,+,1}<%for.body> to i64))<nuw><nsw> + %a)<nsw>
	; CHECK-NEXT: --> {(4 + %a),+,4}<%for.body>			; CHECK-NEXT: --> {(4 + %a),+,4}<%for.body>
	; CHECK-NEXT: [PSE] %arrayidx4 = getelementptr inbounds i32, i32* %b, i64 %conv11:			; CHECK-NEXT: [PSE] %arrayidx4 = getelementptr inbounds i32, i32* %b, i64 %conv11:
	▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

llvm/trunk/test/Analysis/LoopAccessAnalysis/store-to-invariant-check1.ll

	; RUN: opt < %s -loop-accesses -analyze \| FileCheck -check-prefix=OLDPM %s			; RUN: opt < %s -loop-accesses -analyze \| FileCheck -check-prefix=OLDPM %s
	; RUN: opt -passes='require<scalar-evolution>,require<aa>,loop(print-access-info)' -disable-output < %s 2>&1 \| FileCheck -check-prefix=NEWPM %s			; RUN: opt -passes='require<scalar-evolution>,require<aa>,loop(print-access-info)' -disable-output < %s 2>&1 \| FileCheck -check-prefix=NEWPM %s

	; Test to confirm LAA will find store to invariant address.			; Test to confirm LAA will find store to invariant address.
	; Inner loop has a store to invariant address.			; Inner loop has a store to invariant address.
	;			;
	; for(; i < itr; i++) {			; for(; i < itr; i++) {
	; for(; j < itr; j++) {			; for(; j < itr; j++) {
	; var1[i] = var2[j] + var1[i];			; var1[i] = var2[j] + var1[i];
	; }			; }
	; }			; }

	; The LAA with the new PM is a loop pass so we go from inner to outer loops.			; The LAA with the new PM is a loop pass so we go from inner to outer loops.

	; OLDPM: for.cond1.preheader:			; OLDPM: for.cond1.preheader:
	; OLDPM: Store to invariant address was not found in loop.			; OLDPM: Variant Store to invariant address was not found in loop.
	; OLDPM: for.body3:			; OLDPM: for.body3:
	; OLDPM: Store to invariant address was found in loop.			; OLDPM: Variant Store to invariant address was found in loop.

	; NEWPM: for.body3:			; NEWPM: for.body3:
	; NEWPM: Store to invariant address was found in loop.			; NEWPM: Variant Store to invariant address was found in loop.
	; NEWPM: for.cond1.preheader:			; NEWPM: for.cond1.preheader:
	; NEWPM: Store to invariant address was not found in loop.			; NEWPM: Variant Store to invariant address was not found in loop.

	define i32 @foo(i32* nocapture %var1, i32* nocapture readonly %var2, i32 %itr) #0 {			define i32 @foo(i32* nocapture %var1, i32* nocapture readonly %var2, i32 %itr) #0 {
	entry:			entry:
	%cmp20 = icmp eq i32 %itr, 0			%cmp20 = icmp eq i32 %itr, 0
	br i1 %cmp20, label %for.end10, label %for.cond1.preheader			br i1 %cmp20, label %for.end10, label %for.cond1.preheader

	for.cond1.preheader: ; preds = %entry, %for.inc8			for.cond1.preheader: ; preds = %entry, %for.inc8
	%indvars.iv23 = phi i64 [ %indvars.iv.next24, %for.inc8 ], [ 0, %entry ]			%indvars.iv23 = phi i64 [ %indvars.iv.next24, %for.inc8 ], [ 0, %entry ]
	Show All 32 Lines

llvm/trunk/test/Analysis/LoopAccessAnalysis/store-to-invariant-check2.ll

	; RUN: opt < %s -loop-accesses -analyze \| FileCheck %s			; RUN: opt < %s -loop-accesses -analyze \| FileCheck %s
	; RUN: opt -passes='require<scalar-evolution>,require<aa>,loop(print-access-info)' -disable-output < %s 2>&1 \| FileCheck %s			; RUN: opt -passes='require<scalar-evolution>,require<aa>,loop(print-access-info)' -disable-output < %s 2>&1 \| FileCheck %s

	; Test to confirm LAA will not find store to invariant address.			; Test to confirm LAA will not find store to invariant address.
	; Inner loop has no store to invariant address.			; Inner loop has no store to invariant address.
	;			;
	; for(; i < itr; i++) {			; for(; i < itr; i++) {
	; for(; j < itr; j++) {			; for(; j < itr; j++) {
	; var2[j] = var2[j] + var1[i];			; var2[j] = var2[j] + var1[i];
	; }			; }
	; }			; }

	; CHECK: Store to invariant address was not found in loop.			; CHECK: Variant Store to invariant address was not found in loop.
	; CHECK-NOT: Store to invariant address was found in loop.			; CHECK-NOT: Variant Store to invariant address was found in loop.


	define i32 @foo(i32* nocapture readonly %var1, i32* nocapture %var2, i32 %itr) #0 {			define i32 @foo(i32* nocapture readonly %var1, i32* nocapture %var2, i32 %itr) #0 {
	entry:			entry:
	%cmp20 = icmp eq i32 %itr, 0			%cmp20 = icmp eq i32 %itr, 0
	br i1 %cmp20, label %for.end10, label %for.cond1.preheader			br i1 %cmp20, label %for.end10, label %for.cond1.preheader

	for.cond1.preheader: ; preds = %entry, %for.inc8			for.cond1.preheader: ; preds = %entry, %for.inc8
	Show All 33 Lines

llvm/trunk/test/Analysis/LoopAccessAnalysis/store-to-invariant-check3.ll

	; RUN: opt < %s -loop-accesses -analyze \| FileCheck %s			; RUN: opt < %s -loop-accesses -analyze \| FileCheck %s
	; RUN: opt -passes='require<scalar-evolution>,require<aa>,loop(print-access-info)' -disable-output < %s 2>&1 \| FileCheck %s			; RUN: opt -passes='require<scalar-evolution>,require<aa>,loop(print-access-info)' -disable-output < %s 2>&1 \| FileCheck %s

	; Test to confirm LAA will find store to invariant address.			; Test to confirm LAA will find store to invariant address.
	; Inner loop has a store to invariant address.			; Inner loop has a store to invariant address.
	;			;
	; for(; i < itr; i++) {			; for(; i < itr; i++) {
	; for(; j < itr; j++) {			; for(; j < itr; j++) {
	; var1[j] = ++var2[i] + var1[j];			; var1[j] = ++var2[i] + var1[j];
	; }			; }
	; }			; }

	; CHECK: Store to invariant address was found in loop.			; CHECK: Variant Store to invariant address was found in loop.

	define void @foo(i32* nocapture %var1, i32* nocapture %var2, i32 %itr) #0 {			define void @foo(i32* nocapture %var1, i32* nocapture %var2, i32 %itr) #0 {
	entry:			entry:
	%cmp20 = icmp sgt i32 %itr, 0			%cmp20 = icmp sgt i32 %itr, 0
	br i1 %cmp20, label %for.cond1.preheader, label %for.end11			br i1 %cmp20, label %for.cond1.preheader, label %for.end11

	for.cond1.preheader: ; preds = %entry, %for.inc9			for.cond1.preheader: ; preds = %entry, %for.inc9
	%indvars.iv23 = phi i64 [ %indvars.iv.next24, %for.inc9 ], [ 0, %entry ]			%indvars.iv23 = phi i64 [ %indvars.iv.next24, %for.inc9 ], [ 0, %entry ]
	Show All 33 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -loop-vectorize -S -mcpu=skylake-avx512 -instcombine < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; first test checks that loop with a reduction and a uniform store gets
				; vectorized.
				; CHECK-LABEL: inv_val_store_to_inv_address_with_reduction
				; CHECK-LABEL: vector.memcheck:
				; CHECK: found.conflict

				; CHECK-LABEL: vector.body:
				; CHECK: %vec.phi = phi <16 x i32> [ zeroinitializer, %vector.ph ], [ [[ADD:%[a-zA-Z0-9.]+]], %vector.body ]
				; CHECK: %wide.load = load <16 x i32>
				; CHECK: [[ADD]] = add <16 x i32> %vec.phi, %wide.load
				; CHECK: store i32 %ntrunc, i32* %a
				; CHECK-NOT: store i32 %ntrunc, i32* %a
				; CHECK: %index.next = add i64 %index, 64

				; CHECK-LABEL: middle.block:
				; CHECK: %rdx.shuf = shufflevector <16 x i32>
				define i32 @inv_val_store_to_inv_address_with_reduction(i32* %a, i64 %n, i32* %b) {
				entry:
				%ntrunc = trunc i64 %n to i32
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
				%tmp0 = phi i32 [ %tmp3, %for.body ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				%tmp3 = add i32 %tmp0, %tmp2
				store i32 %ntrunc, i32* %a
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				%tmp4 = phi i32 [ %tmp3, %for.body ]
				ret i32 %tmp4
				}

				; Conditional store
				; if (b[i] == k) a = ntrunc
				define void @inv_val_store_to_inv_address_conditional(i32* %a, i64 %n, i32* %b, i32 %k) {
				; CHECK-LABEL: @inv_val_store_to_inv_address_conditional(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[NTRUNC:%.]] = trunc i64 [[N:%.]] to i32
				; CHECK-NEXT: [[TMP0:%.*]] = icmp sgt i64 [[N]], 1
				; CHECK-NEXT: [[SMAX:%.*]] = select i1 [[TMP0]], i64 [[N]], i64 1
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[SMAX]], 16
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
				; CHECK: vector.memcheck:
				; CHECK-NEXT: [[A4:%.]] = bitcast i32 [[A:%.]] to i8
				; CHECK-NEXT: [[B1:%.]] = bitcast i32 [[B:%.]] to i8
				; CHECK-NEXT: [[TMP1:%.*]] = icmp sgt i64 [[N]], 1
				; CHECK-NEXT: [[SMAX2:%.*]] = select i1 [[TMP1]], i64 [[N]], i64 1
				; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr i32, i32 [[B]], i64 [[SMAX2]]
				; CHECK-NEXT: [[UGLYGEP:%.]] = getelementptr i8, i8 [[A4]], i64 1
				; CHECK-NEXT: [[BOUND0:%.]] = icmp ugt i8 [[UGLYGEP]], [[B1]]
				; CHECK-NEXT: [[BOUND1:%.]] = icmp ugt i32 [[SCEVGEP]], [[A]]
				; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
				; CHECK-NEXT: br i1 [[FOUND_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[SMAX]], 9223372036854775792
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT5:%.]] = insertelement <16 x i32> undef, i32 [[K:%.]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT6:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT5]], <16 x i32> undef, <16 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT7:%.*]] = insertelement <16 x i32> undef, i32 [[NTRUNC]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT8:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT7]], <16 x i32> undef, <16 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT9:%.]] = insertelement <16 x i32> undef, i32* [[A]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT10:%.]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT9]], <16 x i32*> undef, <16 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDEX]]
				; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <16 x i32>, <16 x i32> [[TMP3]], align 8, !alias.scope !8, !noalias !11
				; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <16 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT6]]
				; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*
				; CHECK-NEXT: store <16 x i32> [[BROADCAST_SPLAT8]], <16 x i32>* [[TMP5]], align 4, !alias.scope !8, !noalias !11
				; CHECK-NEXT: call void @llvm.masked.scatter.v16i32.v16p0i32(<16 x i32> [[BROADCAST_SPLAT8]], <16 x i32*> [[BROADCAST_SPLAT10]], i32 4, <16 x i1> [[TMP4]]), !alias.scope !11
				; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
				; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !13
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[LATCH:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]
				; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 [[TMP1]], align 8
				; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[TMP2]], [[K]]
				; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[TMP1]], align 4
				; CHECK-NEXT: br i1 [[CMP]], label [[COND_STORE:%.*]], label [[LATCH]]
				; CHECK: cond_store:
				; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[A]], align 4
				; CHECK-NEXT: br label [[LATCH]]
				; CHECK: latch:
				; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
				; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
				; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop !14
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				entry:
				%ntrunc = trunc i64 %n to i32
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %latch ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				%cmp = icmp eq i32 %tmp2, %k
				store i32 %ntrunc, i32* %tmp1
				br i1 %cmp, label %cond_store, label %latch

				cond_store:
				store i32 %ntrunc, i32* %a
				br label %latch

				latch:
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}

llvm/trunk/test/Transforms/LoopVectorize/invariant-store-vectorization.ll

				; RUN: opt < %s -licm -loop-vectorize -force-vector-width=4 -dce -instcombine -licm -S \| FileCheck %s

				; First licm pass is to hoist/sink invariant stores if possible. Today LICM does
				; not hoist/sink the invariant stores. Even if that changes, we should still
				; vectorize this loop in case licm is not run.

				; The next licm pass after vectorization is to hoist/sink loop invariant
				; instructions.
				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"

				; all tests check that it is legal to vectorize the stores to invariant
				; address.


				; CHECK-LABEL: inv_val_store_to_inv_address_with_reduction(
				; memory check is found.conflict = b[max(n-1,1)] > a && (i8* a)+1 > (i8* b)
				; CHECK: vector.memcheck:
				; CHECK: found.conflict

				; CHECK-LABEL: vector.body:
				; CHECK: %vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ [[ADD:%[a-zA-Z0-9.]+]], %vector.body ]
				; CHECK: %wide.load = load <4 x i32>
				; CHECK: [[ADD]] = add <4 x i32> %vec.phi, %wide.load
				; CHECK-NEXT: store i32 %ntrunc, i32* %a
				; CHECK-NEXT: %index.next = add i64 %index, 4
				; CHECK-NEXT: icmp eq i64 %index.next, %n.vec
				; CHECK-NEXT: br i1

				; CHECK-LABEL: middle.block:
				; CHECK: %rdx.shuf = shufflevector <4 x i32>
				define i32 @inv_val_store_to_inv_address_with_reduction(i32* %a, i64 %n, i32* %b) {
				entry:
				%ntrunc = trunc i64 %n to i32
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
				%tmp0 = phi i32 [ %tmp3, %for.body ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				%tmp3 = add i32 %tmp0, %tmp2
				store i32 %ntrunc, i32* %a
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				%tmp4 = phi i32 [ %tmp3, %for.body ]
				ret i32 %tmp4
				}

				; CHECK-LABEL: inv_val_store_to_inv_address(
				; CHECK-LABEL: vector.body:
				; CHECK: store i32 %ntrunc, i32* %a
				; CHECK: store <4 x i32>
				; CHECK-NEXT: %index.next = add i64 %index, 4
				; CHECK-NEXT: icmp eq i64 %index.next, %n.vec
				; CHECK-NEXT: br i1
				define void @inv_val_store_to_inv_address(i32* %a, i64 %n, i32* %b) {
				entry:
				%ntrunc = trunc i64 %n to i32
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				store i32 %ntrunc, i32* %a
				store i32 %ntrunc, i32* %tmp1
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}


				; Both of these tests below are handled as predicated stores.

				; Conditional store
				; if (b[i] == k) a = ntrunc
				; TODO: We can be better with the code gen for the first test and we can have
				; just one scalar store if vector.or.reduce(vector_cmp(b[i] == k)) is 1.

				; CHECK-LABEL:inv_val_store_to_inv_address_conditional(
				; CHECK-LABEL: vector.body:
				; CHECK: %wide.load = load <4 x i32>, <4 x i32>*
				; CHECK: [[CMP:%[a-zA-Z0-9.]+]] = icmp eq <4 x i32> %wide.load, %{{.*}}
				; CHECK: store <4 x i32>
				; CHECK-NEXT: [[EE:%[a-zA-Z0-9.]+]] = extractelement <4 x i1> [[CMP]], i32 0
				; CHECK-NEXT: br i1 [[EE]], label %pred.store.if, label %pred.store.continue

				; CHECK-LABEL: pred.store.if:
				; CHECK-NEXT: store i32 %ntrunc, i32* %a
				; CHECK-NEXT: br label %pred.store.continue

				; CHECK-LABEL: pred.store.continue:
				; CHECK-NEXT: [[EE1:%[a-zA-Z0-9.]+]] = extractelement <4 x i1> [[CMP]], i32 1
				define void @inv_val_store_to_inv_address_conditional(i32* %a, i64 %n, i32* %b, i32 %k) {
				entry:
				%ntrunc = trunc i64 %n to i32
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %latch ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				%cmp = icmp eq i32 %tmp2, %k
				store i32 %ntrunc, i32* %tmp1
				br i1 %cmp, label %cond_store, label %latch

				cond_store:
				store i32 %ntrunc, i32* %a
				br label %latch

				latch:
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}

				; if (b[i] == k)
				; a = ntrunc
				; else a = k;
				; TODO: We could vectorize this once we support multiple uniform stores to the
				; same address.
				; CHECK-LABEL:inv_val_store_to_inv_address_conditional_diff_values(
				; CHECK-NOT: load <4 x i32>
				define void @inv_val_store_to_inv_address_conditional_diff_values(i32* %a, i64 %n, i32* %b, i32 %k) {
				entry:
				%ntrunc = trunc i64 %n to i32
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %latch ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				%cmp = icmp eq i32 %tmp2, %k
				store i32 %ntrunc, i32* %tmp1
				br i1 %cmp, label %cond_store, label %cond_store_k

				cond_store:
				store i32 %ntrunc, i32* %a
				br label %latch

				cond_store_k:
				store i32 %k, i32 * %a
				br label %latch

				latch:
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}

				; Instcombine'd version of above test. Now the store is no longer of invariant
				; value.
				; TODO: We should be able to vectorize this loop once we support vectorizing
				; stores of variant values to invariant addresses.
				; CHECK-LABEL: inv_val_store_to_inv_address_conditional_diff_values_ic
				; CHECK-NOT: <4 x
				define void @inv_val_store_to_inv_address_conditional_diff_values_ic(i32* %a, i64 %n, i32* %b, i32 %k) {
				entry:
				%ntrunc = trunc i64 %n to i32
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %latch ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				%cmp = icmp eq i32 %tmp2, %k
				store i32 %ntrunc, i32* %tmp1
				br i1 %cmp, label %cond_store, label %cond_store_k

				cond_store:
				br label %latch

				cond_store_k:
				br label %latch

				latch:
				%storeval = phi i32 [ %ntrunc, %cond_store ], [ %k, %cond_store_k ]
				store i32 %storeval, i32* %a
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}

				; invariant val stored to invariant address predicated on invariant condition
				; This is not treated as a predicated store since the block the store belongs to
				; is the latch block (which doesn't need to be predicated).
				; TODO: We should vectorize this loop once we relax the check for
				; variant/invariant values being stored to invariant address.
				; CHECK-LABEL: inv_val_store_to_inv_address_conditional_inv
				; CHECK-NOT: <4 x
				define void @inv_val_store_to_inv_address_conditional_inv(i32* %a, i64 %n, i32* %b, i32 %k) {
				entry:
				%ntrunc = trunc i64 %n to i32
				%cmp = icmp eq i32 %ntrunc, %k
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %latch ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				store i32 %ntrunc, i32* %tmp1
				br i1 %cmp, label %cond_store, label %cond_store_k

				cond_store:
				br label %latch

				cond_store_k:
				br label %latch

				latch:
				%storeval = phi i32 [ %ntrunc, %cond_store ], [ %k, %cond_store_k ]
				store i32 %storeval, i32* %a
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}

				; TODO: This loop can be vectorized once we support variant value being
				; stored into invariant address.
				; CHECK-LABEL: variant_val_store_to_inv_address
				; CHECK-NOT: <4 x i32>
				define i32 @variant_val_store_to_inv_address(i32* %a, i64 %n, i32* %b, i32 %k) {
				entry:
				%ntrunc = trunc i64 %n to i32
				%cmp = icmp eq i32 %ntrunc, %k
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
				%tmp0 = phi i32 [ %tmp3, %for.body ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				store i32 %tmp2, i32* %a
				%tmp3 = add i32 %tmp0, %tmp2
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				%rdx.lcssa = phi i32 [ %tmp0, %for.body ]
				ret i32 %rdx.lcssa
				}

llvm/trunk/test/Transforms/LoopVectorize/pr31190.ll

	Show All 23 Lines
	; Note that we can no longer get the vectorizer to actually see such PHIs,			; Note that we can no longer get the vectorizer to actually see such PHIs,
	; because LV now simplifies the loop internally, but the test is still			; because LV now simplifies the loop internally, but the test is still
	; useful as a regression test, and in case loop-simplify behavior changes.			; useful as a regression test, and in case loop-simplify behavior changes.

	@c = external global i32, align 4			@c = external global i32, align 4
	@a = external global i32, align 4			@a = external global i32, align 4
	@b = external global [1 x i32], align 4			@b = external global [1 x i32], align 4

	; CHECK: LV: Not vectorizing: Cannot prove legality.			; We can vectorize this loop because we are storing an invariant value into an
				; invariant address.

				; CHECK: LV: We can vectorize this loop!
	; CHECK-LABEL: @test			; CHECK-LABEL: @test
	define void @test() {			define void @test() {
	entry:			entry:
	%a.promoted2 = load i32, i32* @a, align 1			%a.promoted2 = load i32, i32* @a, align 1
	%c.promoted = load i32, i32* @c, align 1			%c.promoted = load i32, i32* @c, align 1
	br label %for.cond1.preheader			br label %for.cond1.preheader

	for.cond1.preheader: ; preds = %for.cond1.for.inc4_crit_edge, %entry			for.cond1.preheader: ; preds = %for.cond1.for.inc4_crit_edge, %entry
	Show All 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LV][LAA] Vectorize loop invariant values stored into loop invariant addressClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 166987

llvm/trunk/include/llvm/Analysis/LoopAccessAnalysis.h

llvm/trunk/lib/Analysis/LoopAccessAnalysis.cpp

llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/trunk/test/Analysis/LoopAccessAnalysis/memcheck-wrapping-pointers.ll

llvm/trunk/test/Analysis/LoopAccessAnalysis/store-to-invariant-check1.ll

llvm/trunk/test/Analysis/LoopAccessAnalysis/store-to-invariant-check2.ll

llvm/trunk/test/Analysis/LoopAccessAnalysis/store-to-invariant-check3.ll

llvm/trunk/test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll

llvm/trunk/test/Transforms/LoopVectorize/invariant-store-vectorization.ll

llvm/trunk/test/Transforms/LoopVectorize/pr31190.ll

[LV][LAA] Vectorize loop invariant values stored into loop invariant address
ClosedPublic