This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
2/26
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/X86/
-
Transforms/
-
LoopVectorize/
-
X86/
-
gather_scatter.ll
-
invariant-load-gather.ll

Differential D51313

[LV] Fix code gen for conditionally executed uniform loads
ClosedPublic

Authored by anna on Aug 27 2018, 9:19 AM.

Download Raw Diff

Details

Reviewers

Ayal
mkuper
hsaito

Commits

rG110df11a1af0: [LV] Fix code gen for conditionally executed loads and stores
rL341673: [LV] Fix code gen for conditionally executed loads and stores

Summary

While working on D50665, I came across a latent bug in vectorizer which generates
incorrect code for uniform memory accesses that are executed conditionally.

This affects architectures that have masked gather/scatter support.
See added test case in X86. Without this patch, we were unconditionally
executing the load in the vectorized version. This can introduce a SEGFAULT
which never occurs in the scalar version.

The fix here is to avoid scalarizing of uniform loads that are executed
conditionally. On architectures with masked gather support, these loads
should use the masked gather instruction.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 22066
Build 22066: arc lint + arc unit

Event Timeline

anna created this revision.Aug 27 2018, 9:19 AM

Harbormaster completed remote builds in B21964: Diff 162692.Aug 27 2018, 9:19 AM

Herald added a subscriber: rkruppe. · View Herald TranscriptAug 27 2018, 9:19 AM

delena removed a reviewer: delena.Aug 27 2018, 11:50 AM

hsaito added a reviewer: hsaito.Aug 27 2018, 2:19 PM

This affects architectures that have masked gather/scatter support.

I moved that check from legal to cost model (D43208). As a result, the same issue should happen for arch w/o masked gather/scatter support, probably less often. If you can make the LIT test generic (i.e., not x86 specific), that would be better (but not a must-fix).

We should introduce something like CM_PredicatedUniform and then generate code as "if (mask is not all zero){ load and bcast }" (and that should make the FileCheck validation much simpler), but stop bleeding first approach is fine.

Nice small fix. LGTM.

LGTM

This revision is now accepted and ready to land.Aug 27 2018, 2:23 PM

Ayal added inline comments.Aug 27 2018, 4:41 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5906	Indeed scalarizing a conditional load from a uniform address is currently broken, when it can turn into a gather. The culprit is that VPReplicateRecipe, when constructed for such instructions, uses an incorrect IsPredicate = isScalarWithPredication(). Which, as noted, is false for loads and stores of types that can be gathered/scattered. I.e., really means mustBeScalarWithPredication. (That, on top of IsUniform which is also incorrect, causing replication as noted in D50665#1209958.) This proposed fix to refrain from scalarizing such loads, should make sure they always turn into a gather, rather than rely on the cost-based gather-vs-scalarize decision below. The above holds also for conditional loads from non-uniform addresses, that can turn into gathers, but possibly also get incorrectly scalarized w/o branches. It's hard to test, as the scalarization cost needs to be better than the gather for this to occur. But best make sure these also always turn into a gather. It would also/alternatively be good to fix the scalarization of such loads (by correcting IsPredicate), and enable cost-based gather-vs-scalarize decision. Presumably gather should always win, unless perhaps if the load feed scalarized users.

In D51313#1214979, @hsaito wrote:

This affects architectures that have masked gather/scatter support.

I moved that check from legal to cost model (D43208). As a result, the same issue should happen for arch w/o masked gather/scatter support, probably less often. If you can make the LIT test generic (i.e., not x86 specific), that would be better (but not a must-fix).

Taking this back. The trigger requires "isLegalMask*()" returning true.

The above holds also for conditional loads from non-uniform addresses, that can turn into gathers, but possibly also get incorrectly scalarized w/o branches.

Ahh, the above holds also for conditional stores from non-consecutive/interleaved addresses, that can turn into scatters, but possibly also get incorrectly scalarized w/o branches. Good catch indeed!

anna added inline comments.Aug 28 2018, 7:02 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
5900	Also this seems inaccurate - which was how I found the bug in the uniform store handling. `NumPredStores` is really both scatter and (branch + store) predicated stores. A better name would be `NumScalarPredStores`
5906	It seems to me that we need a new function `LoopVectorizationCostModel::isPredicated` which is a superset of instructions contained by `LoopVectorizationCostModel::isScalarWithPredication`? So, the 'complete' fix here is to correct the costmodel as done above and fix `isPredicate` in `VPRecipeBuilder::handleReplication`: bool IsPredicated = CM.isPredicated(I); auto *Recipe = new VPReplicateRecipe(I, IsUniform, IsPredicated); Essentially, whether we use a gather/scatter or we do plain old branch + load/store, (because we don't have gather/scatter support), both are predicated - one is a 'vector predicate', other is scalarized + predicated.

Ayal added inline comments.Aug 28 2018, 3:07 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5906	As for a 'complete' fix... The ten calls to isScalarWithPredication() don't all ask the same thing; it mostly depends on whether setWideningDecision() already took place or not; i.e., the decision if a load/store will be CM_Scalarize (with or w/o predication) or be a CM_GatherScatter. Those called after setWideningDecision() depend on the widening decision set, which depends on the VF, essentially asking if willBeScalarWithPredication(); i.e., excluding CM_GatherScatter. They are the majority, so best have them continue to use isScalarWithPredication(), passing it VF. Three are called before, essentially asking distinct things - here's a breakdown: memInstCanBeWidened(): calls before the decision, in order to make it. Already knows it's a Load/Store. Needs to check if blockNeedsPredication && isMaskRequired && !isLegalMaskedLoad/Store. Note that isLegalMaskedGather/Scatter are irrelevant as they are not considered "Widened". useEmulatedMaskMemRefHack(): calls once before the decision and again after it. But only to assert(isScalarWithPredication()), so need to carefully assert only what's valid in both contexts. getMemInstScalarizationCost: checks what the cost would be if the Load/Store would be scalarized. Needs to check if blockNeedsPredication && isMaskRequired. Note that both isLegalMaskedLoad/Store and isLegalMaskedGather/Scatter are irrelevant. Of the calls that take place after the decision, note that: setCostBasedWideningDecision: bumping NumPredStores++ should best be postponed to after the (CM_Scalarize) decision is reached, at the bottom. handleReplication: should call isScalarWithPredication(I, VF) from within getDecisionAndClampRange(), as done with IsUniform. tryToWiden: should call !isScalarWithPredication(I, VF) from within getDecisionAndClampRange(). Devising tests would be challenging, e.g., making gather/scatter decisions for some VF's yet scalarization for others.

anna added inline comments.Aug 29 2018, 8:22 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
5906	Thanks for the info Ayal, but unfortunately I don't have the bandwidth currently for a complete fix of all the problems described here. My small fix here is the uniform accesses should be handled correctly for the load case irrespective of the VF and VPlan. (and then it will be extended for the store as the fix for D50665), but as you pointed out it only adjusts the cost model. The actual vectorization should also be fixed so that a gather is guaranteed to remain a gather. From your explanation and reading the code, my understanding is that the right fix for the uniform accesses is: what I've currently got here in `setCostBasedWideningDecision` to avoid CM_Scalarize for predicated blocks Fix `memoryInstructionCanBeWidened` use of isScalarWithPredication Fix for `getMemInstScalarizationCost` use of isScalarWithPredication so that we will correctly choose the decision. Implement isScalarWithPredication(I, VF) where we retrive the widening_decision for the VF and make sure it's NOT CM_gatherscatter. handleReplication calls isScalarWithPredication(I, VF) tryToWiden: should call !isScalarWithPredication(I, VF)

addressed review comments - we make sure that the vectorization also uses the cost decision of gather/scatter
instead of scalarizing.
Also, handles the original bug of generating incorrect code for conditional uniform loads.

Harbormaster completed remote builds in B22066: Diff 163148.Aug 29 2018, 11:23 AM

To state what's fixed in latest patch:

Uniform conditional loads on architectures with gather support will have correct code generated. In particular, the cost model (setCostBasedWideningDecision) is fixed. The codeGen for replication and widening recipes do the respective operations ONLY for widening decision != CM_GatherScatter. 1.1 For the recipes which are handled after the widening decision is set, we use the isScalarWithPredication(I, VF) form which is added in the patch.

Fix the vectorization cost model for scalarization (getMemInstScalarizationCost): implement and use isPredicatedInst to identify *all* predicated instructions, not just scalar+predicated. So, now the cost for scalarization will be increased for maskedloads/stores and gather/scatter operations. In short, we should be choosing the gather/scatter in place of scalarization on archs where it is profitable. In archs, where we don't have support for gather/scatter, we may no longer vectorize because scalarized and predicated loads/stores have high cost.

Given above 2, needed to weaken the assert in useEmulatedMaskMemRefHack.

anna mentioned this in D50665: [LV][LAA] Vectorize loop invariant values stored into loop invariant address.Aug 30 2018, 7:51 AM

anna requested review of this revision.Aug 30 2018, 7:57 AM

ping

hsaito added inline comments.Sep 4 2018, 6:48 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
1433	Maybe a nit picking, but I think the logic is clearer if we write this code as in if (load_or_store) { return Legal->isMaskRequired(I) } return isScalarWithPredication(I)
4404	If "WideningDecision == CM_Scalarize" doesn't work here, we need comments on why that won't work. Else, change the condition to more straightforward "WideningDecition == CM_Scalarize" to avoid bugs now or later.
6826	Same as Line 4404.

anna marked 2 inline comments as done.Sep 5 2018, 12:23 PM

anna added inline comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
6826	I believe it does not apply here. We specifically need to check for widening decisions. At this point, the decision can be CM_GATHERSCATTER or a widening decision (if we hadn't called `tryToWidenMemory` and fail this assert). We will never get CM_Scalarize here because of the check at line 6747. So, I think this assert is pretty clear in that aspect.

addressed review comments.

Harbormaster completed remote builds in B22279: Diff 164101.Sep 5 2018, 12:26 PM

hsaito added inline comments.Sep 5 2018, 3:35 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5755	Just a note. No actionable items here. Where the code modification matters, the code before the change ends up comparing masked gather/scatter (we shouldn't see masked widened load/store in the called context) against unmasked scalarized load/store. So this change makes sense. As a result, the modified assert condition in useEmulatedMaskMemRefHack() makes sense. Just be aware that we may end up losing some vectorization as a result, but those are likely to be the places where we are generating wrong code today: masked gather/scatter is available but cost model decides to scalarize. This scalarized code isn't properly masked --- because masking there is gated by masked gather/scatter being unavailable.
6826	In non-NativeVPlan mode, tryToWidenMemory() is the only place that creates VPWidenMemoryRecipe(), which handles Widen, Widen_Reverse, and GatherScatter. If we intend to emit vector loads/stores or gathers/scatters, that needs to be caught there. Line 6747 filters out only the predicated scalars. So, unmasked scalar loads/stores would hit here. So, in my opinion, if CM_GatherScatter hits here, that means "cost model wants to use gather/scatter but we'll serialize". That is an error condition, isn't it?

anna added inline comments.Sep 6 2018, 10:49 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
6826	Line 6747 filters out only the predicated scalars. So, unmasked scalar loads/stores would hit here. agreed. that's right. So we can hit CM_Scalarize here. So, in my opinion, if CM_GatherScatter hits here, that means "cost model wants to use gather/scatter but we'll serialize". That is an error condition, isn't it? Actually, that doesn't agree with the error message though: "Memory widening decisions should have been taken care by now". Also, even if we hit CM_GatherScatter here, we will not widen because `willWiden` returns false. Basically, this function `tryToWiden` does not handle loads and stores of any kind: by this point, we should finish `CM_Widen, Widen_reverse, GatherScatter and CM_Interleave`. So, I think you mean that we should leave the assert as-is because only non-predicated loads and stores of CM_Scalarize are allowed to reach here? Thanks for clearing this. Also, the error message is really unclear in that aspect. It's not that widening isn't allowed here - none of the other widening decisions are allowed. It should state something like "only unmasked loads and stores allowed here"

addressed review comment.

Harbormaster completed remote builds in B22334: Diff 164264.Sep 6 2018, 12:06 PM

LGTM

lib/Transforms/Vectorize/LoopVectorize.cpp
6826	It should state something like "only unmasked loads and stores allowed here" I agree. That would be clearer.

This revision is now accepted and ready to land.Sep 6 2018, 12:23 PM

Closed by commit rL341673: [LV] Fix code gen for conditionally executed loads and stores (authored by annat). · Explain WhySep 7 2018, 8:58 AM

This revision was automatically updated to reflect the committed changes.

(post commit review)

lib/Transforms/Vectorize/LoopVectorize.cpp
1424	"non-zero" >> "non one": VF is always non-zero. VF=1 usually stands for the original scalar instructions (e.g., possibly unrolled by UF). Here, iiuc, VF>1 corresponds to asking willBeScalarWithPredication() - after per-VF decisions have been taken, whereas VF=1 corresponds to asking mustBeScalarWithPredication() - before per-VF decisions have been taken; as called by the assert of useEmulatedMaskMemRefHack() in one context. But tryToWiden() and handleReplication() may also pass VF=1, and they call after decisions have been made. Best use isScalarWithPredication() always after per-VF decisions have been made. In any case, better have each caller specify the VF explicitly, to make sure it's passed whenever available; e.g., see below. "Such instructions include" also conditional loads, with this patch.
1428	Note that this interface is VF agnostic; some VF's may decide to scalarize/replicate where other VF's gather/scatter, but they all predicate, or none do. As mentioned below, this could be simplified to mean isMaskedMemInst(), w/o needing to call isScalarWithPredication(VF=1).
4408	When called after having made per-VF decisions, having VF=1 simply says it's scalar, rather than checking mustBeScalar here.
5520	`isScalarWithPredication(&I, VF)`
5906	Thanks for the info Ayal, but unfortunately I don't have the bandwidth currently for a complete fix of all the problems described here. OK, sure, understood (and thanks BTW for helping take care of PR38786 :-). This was in response to your "So, the 'complete' fix here is to ..." OTOH, looks like this patch does take care of nearly everything listed above ... except for (1) memInstCanBeWidened() and (1) bumping NumPredStores after we've made the decision - which raises a subtle dependence issue as useEmulatedMaskMemRefHack() uses NumPredStores. And making sure VF is passed to isScalarWithPredication() wherever it should. So out of the ten calls to isScalarWithPredication(), two are redirected here to isPredicated() - the assert in useEmulatedMaskMemRefHack() and the call in getMemInstScalarizationCost(), effectively taking care of (2) and (3) above. Note that both evidently deal with Mem loads/stores. As an alternative, collectInstsToScalarize() could first check if I is a load/store before it calls useEmulatedMaskMemRefHack(&I). This way, isPredicate(I) will always be used for loads/stores, and could be folded into an `if (blockNeedsPredication && isMaskRequired)`, as noted above for (3). This proposed fix to refrain from scalarizing such loads, should make sure they always turn into a gather, rather than rely on the cost-based gather-vs-scalarize decision below. So the fix here allows conditional loads from uniform addresses to eventually turn into gathers or be scalarized, according to cost-model decision, right? No enforcement to scalarize appears below. Wonder if the above addition of !BlockNeedsPredication is still needed(?)

anna added inline comments.Sep 10 2018, 7:07 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
1424	"non-zero" >> "non one": VF is always non-zero. VF=1 usually stands for the original scalar instructions (e.g., possibly unrolled by UF). oops. typo, will fix the comment.
1428	As mentioned below, this could be simplified to mean isMaskedMemInst(), w/o needing to call isScalarWithPredication(VF=1). Actually what hsaito mentioned below is an NFC - just changing the order of load/store versus other instructions. Secondly, I think we cannot do what you mentioned - that's what I started with originally and failed test cases. Specifically, isMaskedMemInst() looks for exactly that - memory instructions. See `MaskedOp` insertion in `LoopVectorizationLegality::blockCanBePredicated` However, other instructions like divide etc are also predicated - see `isScalarWithPredication` for non-memory instructions.
4408	This is the form of `isScalarWithPredication` that is called before making any widening decisions, i.e. all the calls without passing in the VF. Do we need to make the distinction between "no decision made yet" versus "decision made to scalarize instead if vectorization"?
5520	will fix.
5906	So the fix here allows conditional loads from uniform addresses to eventually turn into gathers or be scalarized, according to cost-model decision, right? Yes, it is according to the cost model decision that is made later on in this function itself. Wonder if the above addition of !BlockNeedsPredication is still needed(?) I think we still need it. Otherwise, we're not giving the cost model a chance to choose between scalarize versus gather-scatter and gather-scatter maybe more profitable in AVX-512 for example. (See `continue` at line 5909)

Ayal added inline comments.Sep 10 2018, 4:54 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
1428	Note that this interface is VF agnostic; some VF's may decide to scalarize/replicate where other VF's gather/scatter, but they all predicate, or none do. As mentioned below, this could be simplified to mean isMaskedMemInst(), w/o needing to call isScalarWithPredication(VF=1). The "interface" referred to above was isPredicatedInst(), which was called only from useEmulatedMaskMemRefHack() and getMemInstScalarizationCost(), and could be simplified to deal with loads/stores only; unlike isScalarWithPredication(). But looks like isPredicatedInst() was dropped, and replaced by calls to isScalarWithPredication() with default VF=1?
4408	Yes, this was the distinction I suggested we should make: to devote `isScalarWithPredication()` with a mandatory VF parameter to mean "willBeScalarWithPredication()" for that VF, i.e., after decisions have been made; this is also in accordance with its original documentation: /// Returns true if \p I is an instruction that will be scalarized with /// predication. , and have the 3 calls that take place before decisions have been made, each check explicitly what it needs: memoryInstructionCanBeWidened(): instead of // If the instruction is a store located in a predicated block, it will be // scalarized. if (isScalarWithPredication(I)) return false; have something like // If the instruction is a load/store located in a predicated block, requires a mask, and masked loads/stores are unavailable, the instruction will not be widened. Note that isLegalMaskedGather/Scatter are irrelevant as they are not considered "Widened" here. if (blockNeedsPredication && isMaskRequired && !isLegalMaskedLoad/Store) return false; useEmulatedMaskMemRefHack(): instead of assert(isPredicatedInst(I) && "Expecting a scalar emulated instruction"); have something like if (not load nor store) return false; assert(blockNeedsPredication && isMaskRequired && "Expecting a scalar emulated instruction"); similar to isPredicatedInst(). getMemInstScalarizationCost: can also use "if (blockNeedsPredication && isMaskRequired)" as above. Note that conditional loads from safe addresses, i.e. when isMaskRequired is false, may still benefit from sinking under their condition when scalarized; but this is taken care of by computePredInstDiscount() / sinkScalarOperands(). bump NumPredStores++ to after deciding to scalarize (and predicate), retaining the behavior of getMemInstScalarizationCost() which uses it.
5906	OK, so we (continue to) assume that a scalar load + broadcast is better than a gather for the unconditional case, but fall back to general cost model based scalarizing versus gather for the conditional uniform load case. At-least until a scalar load + broadcast under a single 'if mask not empty' condition is introduced. The `// Conditional loads should be scalarized and predicated.` comment should be updated.

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

56 lines

test/

Transforms/

LoopVectorize/

X86/

gather_scatter.ll

631 lines

invariant-load-gather.ll

93 lines

Diff 163148

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,414 Lines • ▼ Show 20 Lines	if (!LI && !SI)
return false;		return false;
auto *Ty = getMemInstValueType(V);		auto *Ty = getMemInstValueType(V);
return (LI && isLegalMaskedGather(Ty)) \|\| (SI && isLegalMaskedScatter(Ty));		return (LI && isLegalMaskedGather(Ty)) \|\| (SI && isLegalMaskedScatter(Ty));
}		}

/// Returns true if \p I is an instruction that will be scalarized with		/// Returns true if \p I is an instruction that will be scalarized with
/// predication. Such instructions include conditional stores and		/// predication. Such instructions include conditional stores and
/// instructions that may divide by zero.		/// instructions that may divide by zero.
bool isScalarWithPredication(Instruction *I);		/// If a non-zero VF has been calculated, we check if I will be scalarized
		/// predication for that VF.
		AyalUnsubmitted Not Done Reply Inline Actions "non-zero" >> "non one": VF is always non-zero. VF=1 usually stands for the original scalar instructions (e.g., possibly unrolled by UF). Here, iiuc, VF>1 corresponds to asking willBeScalarWithPredication() - after per-VF decisions have been taken, whereas VF=1 corresponds to asking mustBeScalarWithPredication() - before per-VF decisions have been taken; as called by the assert of useEmulatedMaskMemRefHack() in one context. But tryToWiden() and handleReplication() may also pass VF=1, and they call after decisions have been made. Best use isScalarWithPredication() always after per-VF decisions have been made. In any case, better have each caller specify the VF explicitly, to make sure it's passed whenever available; e.g., see below. "Such instructions include" also conditional loads, with this patch. Ayal: "non-zero" >> "non one": VF is always non-zero. VF=1 usually stands for the original scalar…
		annaAuthorUnsubmitted Not Done Reply Inline Actions "non-zero" >> "non one": VF is always non-zero. VF=1 usually stands for the original scalar instructions (e.g., possibly unrolled by UF). oops. typo, will fix the comment. anna: > "non-zero" >> "non one": VF is always non-zero. VF=1 usually stands for the original scalar…
		bool isScalarWithPredication(Instruction *I, unsigned VF = 1);

		// Returns true if \p I is an instruction that will be predicated either
		// through scalar predication or masked load/store or masked gather/scatter.
		AyalUnsubmitted Not Done Reply Inline Actions Note that this interface is VF agnostic; some VF's may decide to scalarize/replicate where other VF's gather/scatter, but they all predicate, or none do. As mentioned below, this could be simplified to mean isMaskedMemInst(), w/o needing to call isScalarWithPredication(VF=1). Ayal: Note that this interface is VF agnostic; some VF's may decide to scalarize/replicate where…
		annaAuthorUnsubmitted Not Done Reply Inline Actions As mentioned below, this could be simplified to mean isMaskedMemInst(), w/o needing to call isScalarWithPredication(VF=1). Actually what hsaito mentioned below is an NFC - just changing the order of load/store versus other instructions. Secondly, I think we cannot do what you mentioned - that's what I started with originally and failed test cases. Specifically, isMaskedMemInst() looks for exactly that - memory instructions. See `MaskedOp` insertion in `LoopVectorizationLegality::blockCanBePredicated` However, other instructions like divide etc are also predicated - see `isScalarWithPredication` for non-memory instructions. anna: > As mentioned below, this could be simplified to mean isMaskedMemInst(), w/o needing to call…
		AyalUnsubmitted Not Done Reply Inline Actions Note that this interface is VF agnostic; some VF's may decide to scalarize/replicate where other VF's gather/scatter, but they all predicate, or none do. As mentioned below, this could be simplified to mean isMaskedMemInst(), w/o needing to call isScalarWithPredication(VF=1). The "interface" referred to above was isPredicatedInst(), which was called only from useEmulatedMaskMemRefHack() and getMemInstScalarizationCost(), and could be simplified to deal with loads/stores only; unlike isScalarWithPredication(). But looks like isPredicatedInst() was dropped, and replaced by calls to isScalarWithPredication() with default VF=1? Ayal: > Note that this interface is VF agnostic; some VF's may decide to scalarize/replicate where…
		// Superset of instructions that return true for isScalarWithPredication.
		bool isPredicatedInst(Instruction *I) {
		if (!Legal->blockNeedsPredication(I->getParent()))
		return false;
		if (!isa<LoadInst>(I) && !isa<StoreInst>(I))
		hsaitoUnsubmitted Done Reply Inline Actions Maybe a nit picking, but I think the logic is clearer if we write this code as in if (load_or_store) { return Legal->isMaskRequired(I) } return isScalarWithPredication(I) hsaito: Maybe a nit picking, but I think the logic is clearer if we write this code as in if…
		return isScalarWithPredication(I);
		// Loads and stores that need some form of masked operation are predicated
		// instructions.
		return Legal->isMaskRequired(I);
		}

/// Returns true if \p I is a memory instruction with consecutive memory		/// Returns true if \p I is a memory instruction with consecutive memory
/// access that can be widened.		/// access that can be widened.
bool memoryInstructionCanBeWidened(Instruction *I, unsigned VF = 1);		bool memoryInstructionCanBeWidened(Instruction *I, unsigned VF = 1);

/// Check if \p Instr belongs to any interleaved access group.		/// Check if \p Instr belongs to any interleaved access group.
bool isAccessInterleaved(Instruction *Instr) {		bool isAccessInterleaved(Instruction *Instr) {
return InterleaveInfo.isInterleaved(Instr);		return InterleaveInfo.isInterleaved(Instr);
▲ Show 20 Lines • Show All 2,930 Lines • ▼ Show 20 Lines	for (auto &Induction : *Legal->getInductionVars()) {
LLVM_DEBUG(dbgs() << "LV: Found scalar instruction: " << *Ind << "\n");		LLVM_DEBUG(dbgs() << "LV: Found scalar instruction: " << *Ind << "\n");
LLVM_DEBUG(dbgs() << "LV: Found scalar instruction: " << *IndUpdate		LLVM_DEBUG(dbgs() << "LV: Found scalar instruction: " << *IndUpdate
<< "\n");		<< "\n");
}		}

Scalars[VF].insert(Worklist.begin(), Worklist.end());		Scalars[VF].insert(Worklist.begin(), Worklist.end());
}		}

bool LoopVectorizationCostModel::isScalarWithPredication(Instruction *I) {		bool LoopVectorizationCostModel::isScalarWithPredication(Instruction *I, unsigned VF) {
if (!Legal->blockNeedsPredication(I->getParent()))		if (!Legal->blockNeedsPredication(I->getParent()))
return false;		return false;
switch(I->getOpcode()) {		switch(I->getOpcode()) {
default:		default:
break;		break;
case Instruction::Load:		case Instruction::Load:
case Instruction::Store: {		case Instruction::Store: {
if (!Legal->isMaskRequired(I))		if (!Legal->isMaskRequired(I))
return false;		return false;
auto *Ptr = getLoadStorePointerOperand(I);		auto *Ptr = getLoadStorePointerOperand(I);
auto *Ty = getMemInstValueType(I);		auto *Ty = getMemInstValueType(I);
		// We have already decided how to vectorize this instruction, get that
		// result.
		if (VF > 1) {
		InstWidening WideningDecision = getWideningDecision(I, VF);
		assert(WideningDecision != CM_Unknown &&
		"Widening decision should be ready at this moment");
		// Gather/Scatter are NOT treated as scalar with predication.
		return WideningDecision != CM_GatherScatter;
		hsaitoUnsubmitted Done Reply Inline Actions If "WideningDecision == CM_Scalarize" doesn't work here, we need comments on why that won't work. Else, change the condition to more straightforward "WideningDecition == CM_Scalarize" to avoid bugs now or later. hsaito: If "WideningDecision == CM_Scalarize" doesn't work here, we need comments on why that won't…
		}
return isa<LoadInst>(I) ?		return isa<LoadInst>(I) ?
!(isLegalMaskedLoad(Ty, Ptr) \|\| isLegalMaskedGather(Ty))		!(isLegalMaskedLoad(Ty, Ptr) \|\| isLegalMaskedGather(Ty))
: !(isLegalMaskedStore(Ty, Ptr) \|\| isLegalMaskedScatter(Ty));		: !(isLegalMaskedStore(Ty, Ptr) \|\| isLegalMaskedScatter(Ty));
		AyalUnsubmitted Not Done Reply Inline Actions When called after having made per-VF decisions, having VF=1 simply says it's scalar, rather than checking mustBeScalar here. Ayal: When called after having made per-VF decisions, having VF=1 simply says it's scalar, rather…
		annaAuthorUnsubmitted Not Done Reply Inline Actions This is the form of `isScalarWithPredication` that is called before making any widening decisions, i.e. all the calls without passing in the VF. Do we need to make the distinction between "no decision made yet" versus "decision made to scalarize instead if vectorization"? anna: This is the form of `isScalarWithPredication` that is called before making any widening…
		AyalUnsubmitted Not Done Reply Inline Actions Yes, this was the distinction I suggested we should make: to devote `isScalarWithPredication()` with a mandatory VF parameter to mean "willBeScalarWithPredication()" for that VF, i.e., after decisions have been made; this is also in accordance with its original documentation: /// Returns true if \p I is an instruction that will be scalarized with /// predication. , and have the 3 calls that take place before decisions have been made, each check explicitly what it needs: memoryInstructionCanBeWidened(): instead of // If the instruction is a store located in a predicated block, it will be // scalarized. if (isScalarWithPredication(I)) return false; have something like // If the instruction is a load/store located in a predicated block, requires a mask, and masked loads/stores are unavailable, the instruction will not be widened. Note that isLegalMaskedGather/Scatter are irrelevant as they are not considered "Widened" here. if (blockNeedsPredication && isMaskRequired && !isLegalMaskedLoad/Store) return false; useEmulatedMaskMemRefHack(): instead of assert(isPredicatedInst(I) && "Expecting a scalar emulated instruction"); have something like if (not load nor store) return false; assert(blockNeedsPredication && isMaskRequired && "Expecting a scalar emulated instruction"); similar to isPredicatedInst(). getMemInstScalarizationCost: can also use "if (blockNeedsPredication && isMaskRequired)" as above. Note that conditional loads from safe addresses, i.e. when isMaskRequired is false, may still benefit from sinking under their condition when scalarized; but this is taken care of by computePredInstDiscount() / sinkScalarOperands(). bump NumPredStores++ to after deciding to scalarize (and predicate), retaining the behavior of getMemInstScalarizationCost() which uses it. Ayal: Yes, this was the distinction I suggested we should make: to devote `isScalarWithPredication()`…
}		}
case Instruction::UDiv:		case Instruction::UDiv:
case Instruction::SDiv:		case Instruction::SDiv:
case Instruction::SRem:		case Instruction::SRem:
case Instruction::URem:		case Instruction::URem:
return mayDivideByZero(*I);		return mayDivideByZero(*I);
}		}
return false;		return false;
▲ Show 20 Lines • Show All 1,069 Lines • ▼ Show 20 Lines	bool LoopVectorizationCostModel::useEmulatedMaskMemRefHack(Instruction *I){
// TODO: Cost model for emulated masked load/store is completely		// TODO: Cost model for emulated masked load/store is completely
// broken. This hack guides the cost model to use an artificially		// broken. This hack guides the cost model to use an artificially
// high enough value to practically disable vectorization with such		// high enough value to practically disable vectorization with such
// operations, except where previously deployed legality hack allowed		// operations, except where previously deployed legality hack allowed
// using very low cost values. This is to avoid regressions coming simply		// using very low cost values. This is to avoid regressions coming simply
// from moving "masked load/store" check from legality to cost model.		// from moving "masked load/store" check from legality to cost model.
// Masked Load/Gather emulation was previously never allowed.		// Masked Load/Gather emulation was previously never allowed.
// Limited number of Masked Store/Scatter emulation was allowed.		// Limited number of Masked Store/Scatter emulation was allowed.
assert(isScalarWithPredication(I) &&		assert(isPredicatedInst(I) && "Expecting a scalar emulated instruction");
"Expecting a scalar emulated instruction");
return isa<LoadInst>(I) \|\|		return isa<LoadInst>(I) \|\|
(isa<StoreInst>(I) &&		(isa<StoreInst>(I) &&
NumPredStores > NumberOfStoresToPredicate);		NumPredStores > NumberOfStoresToPredicate);
}		}

void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF) {		void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF) {
// If we aren't vectorizing the loop, or if we've already collected the		// If we aren't vectorizing the loop, or if we've already collected the
// instructions to scalarize, there's nothing to do. Collection may already		// instructions to scalarize, there's nothing to do. Collection may already
Show All 9 Lines	void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF) {

// Find all the instructions that are scalar with predication in the loop and		// Find all the instructions that are scalar with predication in the loop and
// determine if it would be better to not if-convert the blocks they are in.		// determine if it would be better to not if-convert the blocks they are in.
// If so, we also record the instructions to scalarize.		// If so, we also record the instructions to scalarize.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
if (!Legal->blockNeedsPredication(BB))		if (!Legal->blockNeedsPredication(BB))
continue;		continue;
for (Instruction &I : *BB)		for (Instruction &I : *BB)
if (isScalarWithPredication(&I)) {		if (isScalarWithPredication(&I)) {
		AyalUnsubmitted Not Done Reply Inline Actions `isScalarWithPredication(&I, VF)` Ayal: `isScalarWithPredication(&I, VF)`
		annaAuthorUnsubmitted Not Done Reply Inline Actions will fix. anna: will fix.
ScalarCostsTy ScalarCosts;		ScalarCostsTy ScalarCosts;
// Do not apply discount logic if hacked cost is needed		// Do not apply discount logic if hacked cost is needed
// for emulated masked memrefs.		// for emulated masked memrefs.
if (!useEmulatedMaskMemRefHack(&I) &&		if (!useEmulatedMaskMemRefHack(&I) &&
computePredInstDiscount(&I, ScalarCosts, VF) >= 0)		computePredInstDiscount(&I, ScalarCosts, VF) >= 0)
ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end());		ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end());
// Remember that BB will remain after vectorization.		// Remember that BB will remain after vectorization.
PredicatedBBsAfterVectorization.insert(BB);		PredicatedBBsAfterVectorization.insert(BB);
▲ Show 20 Lines • Show All 218 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,

// Get the overhead of the extractelement and insertelement instructions		// Get the overhead of the extractelement and insertelement instructions
// we might create due to scalarization.		// we might create due to scalarization.
Cost += getScalarizationOverhead(I, VF, TTI);		Cost += getScalarizationOverhead(I, VF, TTI);

// If we have a predicated store, it may not be executed for each vector		// If we have a predicated store, it may not be executed for each vector
// lane. Scale the cost by the probability of executing the predicated		// lane. Scale the cost by the probability of executing the predicated
// block.		// block.
if (isScalarWithPredication(I)) {		if (isPredicatedInst(I)) {
		hsaitoUnsubmitted Not Done Reply Inline Actions Just a note. No actionable items here. Where the code modification matters, the code before the change ends up comparing masked gather/scatter (we shouldn't see masked widened load/store in the called context) against unmasked scalarized load/store. So this change makes sense. As a result, the modified assert condition in useEmulatedMaskMemRefHack() makes sense. Just be aware that we may end up losing some vectorization as a result, but those are likely to be the places where we are generating wrong code today: masked gather/scatter is available but cost model decides to scalarize. This scalarized code isn't properly masked --- because masking there is gated by masked gather/scatter being unavailable. hsaito: Just a note. No actionable items here. Where the code modification matters, the code before…
Cost /= getReciprocalPredBlockProb();		Cost /= getReciprocalPredBlockProb();

if (useEmulatedMaskMemRefHack(I))		if (useEmulatedMaskMemRefHack(I))
// Artificially setting to a high enough value to practically disable		// Artificially setting to a high enough value to practically disable
// vectorization with such operations.		// vectorization with such operations.
Cost = 3000000;		Cost = 3000000;
}		}

▲ Show 20 Lines • Show All 128 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::setCostBasedWideningDecision(unsigned VF) {
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
// For each instruction in the old loop.		// For each instruction in the old loop.
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
Value *Ptr = getLoadStorePointerOperand(&I);		Value *Ptr = getLoadStorePointerOperand(&I);
if (!Ptr)		if (!Ptr)
continue;		continue;

if (isa<StoreInst>(&I) && isScalarWithPredication(&I))		if (isa<StoreInst>(&I) && isScalarWithPredication(&I))
NumPredStores++;		NumPredStores++;
		annaAuthorUnsubmitted Not Done Reply Inline Actions Also this seems inaccurate - which was how I found the bug in the uniform store handling. `NumPredStores` is really both scatter and (branch + store) predicated stores. A better name would be `NumScalarPredStores` anna: Also this seems inaccurate - which was how I found the bug in the uniform store handling.
if (isa<LoadInst>(&I) && Legal->isUniform(Ptr)) {
		if (isa<LoadInst>(&I) && Legal->isUniform(Ptr) &&
		// Conditional loads should be scalarized and predicated.
		// isScalarWithPredication cannot be used here since masked
		// gather/scatters are not considered scalar with predication.
		!Legal->blockNeedsPredication(I.getParent())) {
		AyalUnsubmitted Not Done Reply Inline Actions Indeed scalarizing a conditional load from a uniform address is currently broken, when it can turn into a gather. The culprit is that VPReplicateRecipe, when constructed for such instructions, uses an incorrect IsPredicate = isScalarWithPredication(). Which, as noted, is false for loads and stores of types that can be gathered/scattered. I.e., really means mustBeScalarWithPredication. (That, on top of IsUniform which is also incorrect, causing replication as noted in D50665#1209958.) This proposed fix to refrain from scalarizing such loads, should make sure they always turn into a gather, rather than rely on the cost-based gather-vs-scalarize decision below. The above holds also for conditional loads from non-uniform addresses, that can turn into gathers, but possibly also get incorrectly scalarized w/o branches. It's hard to test, as the scalarization cost needs to be better than the gather for this to occur. But best make sure these also always turn into a gather. It would also/alternatively be good to fix the scalarization of such loads (by correcting IsPredicate), and enable cost-based gather-vs-scalarize decision. Presumably gather should always win, unless perhaps if the load feed scalarized users. Ayal: Indeed scalarizing a conditional load from a uniform address is currently broken, when it can…
		annaAuthorUnsubmitted Not Done Reply Inline Actions It seems to me that we need a new function `LoopVectorizationCostModel::isPredicated` which is a superset of instructions contained by `LoopVectorizationCostModel::isScalarWithPredication`? So, the 'complete' fix here is to correct the costmodel as done above and fix `isPredicate` in `VPRecipeBuilder::handleReplication`: bool IsPredicated = CM.isPredicated(I); auto Recipe = new VPReplicateRecipe(I, IsUniform, IsPredicated); Essentially, whether we use a gather/scatter or we do plain old branch + load/store, (because we don't have gather/scatter support), both are predicated - one is a 'vector predicate', other is scalarized + predicated. anna:* It seems to me that we need a new function `LoopVectorizationCostModel::isPredicated` which is…
		AyalUnsubmitted Not Done Reply Inline Actions As for a 'complete' fix... The ten calls to isScalarWithPredication() don't all ask the same thing; it mostly depends on whether setWideningDecision() already took place or not; i.e., the decision if a load/store will be CM_Scalarize (with or w/o predication) or be a CM_GatherScatter. Those called after setWideningDecision() depend on the widening decision set, which depends on the VF, essentially asking if willBeScalarWithPredication(); i.e., excluding CM_GatherScatter. They are the majority, so best have them continue to use isScalarWithPredication(), passing it VF. Three are called before, essentially asking distinct things - here's a breakdown: memInstCanBeWidened(): calls before the decision, in order to make it. Already knows it's a Load/Store. Needs to check if blockNeedsPredication && isMaskRequired && !isLegalMaskedLoad/Store. Note that isLegalMaskedGather/Scatter are irrelevant as they are not considered "Widened". useEmulatedMaskMemRefHack(): calls once before the decision and again after it. But only to assert(isScalarWithPredication()), so need to carefully assert only what's valid in both contexts. getMemInstScalarizationCost: checks what the cost would be if the Load/Store would be scalarized. Needs to check if blockNeedsPredication && isMaskRequired. Note that both isLegalMaskedLoad/Store and isLegalMaskedGather/Scatter are irrelevant. Of the calls that take place after the decision, note that: setCostBasedWideningDecision: bumping NumPredStores++ should best be postponed to after the (CM_Scalarize) decision is reached, at the bottom. handleReplication: should call isScalarWithPredication(I, VF) from within getDecisionAndClampRange(), as done with IsUniform. tryToWiden: should call !isScalarWithPredication(I, VF) from within getDecisionAndClampRange(). Devising tests would be challenging, e.g., making gather/scatter decisions for some VF's yet scalarization for others. Ayal: As for a 'complete' fix... The ten calls to isScalarWithPredication() don't all ask the same…
		annaAuthorUnsubmitted Not Done Reply Inline Actions Thanks for the info Ayal, but unfortunately I don't have the bandwidth currently for a complete fix of all the problems described here. My small fix here is the uniform accesses should be handled correctly for the load case irrespective of the VF and VPlan. (and then it will be extended for the store as the fix for D50665), but as you pointed out it only adjusts the cost model. The actual vectorization should also be fixed so that a gather is guaranteed to remain a gather. From your explanation and reading the code, my understanding is that the right fix for the uniform accesses is: what I've currently got here in `setCostBasedWideningDecision` to avoid CM_Scalarize for predicated blocks Fix `memoryInstructionCanBeWidened` use of isScalarWithPredication Fix for `getMemInstScalarizationCost` use of isScalarWithPredication so that we will correctly choose the decision. Implement isScalarWithPredication(I, VF) where we retrive the widening_decision for the VF and make sure it's NOT CM_gatherscatter. handleReplication calls isScalarWithPredication(I, VF) tryToWiden: should call !isScalarWithPredication(I, VF) anna: Thanks for the info Ayal, but unfortunately I don't have the bandwidth currently for a complete…
		AyalUnsubmitted Not Done Reply Inline Actions Thanks for the info Ayal, but unfortunately I don't have the bandwidth currently for a complete fix of all the problems described here. OK, sure, understood (and thanks BTW for helping take care of PR38786 :-). This was in response to your "So, the 'complete' fix here is to ..." OTOH, looks like this patch does take care of nearly everything listed above ... except for (1) memInstCanBeWidened() and (1) bumping NumPredStores after we've made the decision - which raises a subtle dependence issue as useEmulatedMaskMemRefHack() uses NumPredStores. And making sure VF is passed to isScalarWithPredication() wherever it should. So out of the ten calls to isScalarWithPredication(), two are redirected here to isPredicated() - the assert in useEmulatedMaskMemRefHack() and the call in getMemInstScalarizationCost(), effectively taking care of (2) and (3) above. Note that both evidently deal with Mem loads/stores. As an alternative, collectInstsToScalarize() could first check if I is a load/store before it calls useEmulatedMaskMemRefHack(&I). This way, isPredicate(I) will always be used for loads/stores, and could be folded into an `if (blockNeedsPredication && isMaskRequired)`, as noted above for (3). This proposed fix to refrain from scalarizing such loads, should make sure they always turn into a gather, rather than rely on the cost-based gather-vs-scalarize decision below. So the fix here allows conditional loads from uniform addresses to eventually turn into gathers or be scalarized, according to cost-model decision, right? No enforcement to scalarize appears below. Wonder if the above addition of !BlockNeedsPredication is still needed(?) Ayal: > Thanks for the info Ayal, but unfortunately I don't have the bandwidth currently for a…
		annaAuthorUnsubmitted Not Done Reply Inline Actions So the fix here allows conditional loads from uniform addresses to eventually turn into gathers or be scalarized, according to cost-model decision, right? Yes, it is according to the cost model decision that is made later on in this function itself. Wonder if the above addition of !BlockNeedsPredication is still needed(?) I think we still need it. Otherwise, we're not giving the cost model a chance to choose between scalarize versus gather-scatter and gather-scatter maybe more profitable in AVX-512 for example. (See `continue` at line 5909) anna: > So the fix here allows conditional loads from uniform addresses to eventually turn into…
		AyalUnsubmitted Not Done Reply Inline Actions OK, so we (continue to) assume that a scalar load + broadcast is better than a gather for the unconditional case, but fall back to general cost model based scalarizing versus gather for the conditional uniform load case. At-least until a scalar load + broadcast under a single 'if mask not empty' condition is introduced. The `// Conditional loads should be scalarized and predicated.` comment should be updated. Ayal: OK, so we (continue to) assume that a scalar load + broadcast is better than a gather for the…
// Scalar load + broadcast		// Scalar load + broadcast
unsigned Cost = getUniformMemOpCost(&I, VF);		unsigned Cost = getUniformMemOpCost(&I, VF);
setWideningDecision(&I, VF, CM_Scalarize, Cost);		setWideningDecision(&I, VF, CM_Scalarize, Cost);
continue;		continue;
}		}

// We assume that widening is the best solution when possible.		// We assume that widening is the best solution when possible.
if (memoryInstructionCanBeWidened(&I, VF)) {		if (memoryInstructionCanBeWidened(&I, VF)) {
▲ Show 20 Lines • Show All 823 Lines • ▼ Show 20 Lines	for (unsigned In = 0; In < NumIncoming; In++) {
if (EdgeMask)		if (EdgeMask)
Masks.push_back(EdgeMask);		Masks.push_back(EdgeMask);
}		}
return new VPBlendRecipe(Phi, Masks);		return new VPBlendRecipe(Phi, Masks);
}		}

bool VPRecipeBuilder::tryToWiden(Instruction I, VPBasicBlock VPBB,		bool VPRecipeBuilder::tryToWiden(Instruction I, VPBasicBlock VPBB,
VFRange &Range) {		VFRange &Range) {
if (CM.isScalarWithPredication(I))
		bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange(
		[&](unsigned VF) { return CM.isScalarWithPredication(I, VF); }, Range);

		if (IsPredicated)
return false;		return false;

auto IsVectorizableOpcode = [](unsigned Opcode) {		auto IsVectorizableOpcode = [](unsigned Opcode) {
switch (Opcode) {		switch (Opcode) {
case Instruction::Add:		case Instruction::Add:
case Instruction::And:		case Instruction::And:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::BitCast:		case Instruction::BitCast:
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	if (CallInst *CI = dyn_cast<CallInst>(I)) {
// Is it beneficial to perform intrinsic call compared to lib call?		// Is it beneficial to perform intrinsic call compared to lib call?
bool NeedToScalarize;		bool NeedToScalarize;
unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize);		unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize);
bool UseVectorIntrinsic =		bool UseVectorIntrinsic =
ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost;		ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost;
return UseVectorIntrinsic \|\| !NeedToScalarize;		return UseVectorIntrinsic \|\| !NeedToScalarize;
}		}
if (isa<LoadInst>(I) \|\| isa<StoreInst>(I)) {		if (isa<LoadInst>(I) \|\| isa<StoreInst>(I)) {
assert(CM.getWideningDecision(I, VF) ==		assert((CM.getWideningDecision(I, VF) !=
		hsaitoUnsubmitted Not Done Reply Inline Actions Same as Line 4404. hsaito: Same as Line 4404.
		annaAuthorUnsubmitted Not Done Reply Inline Actions I believe it does not apply here. We specifically need to check for widening decisions. At this point, the decision can be CM_GATHERSCATTER or a widening decision (if we hadn't called `tryToWidenMemory` and fail this assert). We will never get CM_Scalarize here because of the check at line 6747. So, I think this assert is pretty clear in that aspect. anna: I believe it does not apply here. We specifically need to check for widening decisions. At…
		hsaitoUnsubmitted Not Done Reply Inline Actions In non-NativeVPlan mode, tryToWidenMemory() is the only place that creates VPWidenMemoryRecipe(), which handles Widen, Widen_Reverse, and GatherScatter. If we intend to emit vector loads/stores or gathers/scatters, that needs to be caught there. Line 6747 filters out only the predicated scalars. So, unmasked scalar loads/stores would hit here. So, in my opinion, if CM_GatherScatter hits here, that means "cost model wants to use gather/scatter but we'll serialize". That is an error condition, isn't it? hsaito: In non-NativeVPlan mode, tryToWidenMemory() is the only place that creates VPWidenMemoryRecipe…
		annaAuthorUnsubmitted Not Done Reply Inline Actions Line 6747 filters out only the predicated scalars. So, unmasked scalar loads/stores would hit here. agreed. that's right. So we can hit CM_Scalarize here. So, in my opinion, if CM_GatherScatter hits here, that means "cost model wants to use gather/scatter but we'll serialize". That is an error condition, isn't it? Actually, that doesn't agree with the error message though: "Memory widening decisions should have been taken care by now". Also, even if we hit CM_GatherScatter here, we will not widen because `willWiden` returns false. Basically, this function `tryToWiden` does not handle loads and stores of any kind: by this point, we should finish `CM_Widen, Widen_reverse, GatherScatter and CM_Interleave`. So, I think you mean that we should leave the assert as-is because only non-predicated loads and stores of CM_Scalarize are allowed to reach here? Thanks for clearing this. Also, the error message is really unclear in that aspect. It's not that widening isn't allowed here - none of the other widening decisions are allowed. It should state something like "only unmasked loads and stores allowed here" anna: > Line 6747 filters out only the predicated scalars. So, unmasked scalar loads/stores would hit…
		hsaitoUnsubmitted Not Done Reply Inline Actions It should state something like "only unmasked loads and stores allowed here" I agree. That would be clearer. hsaito: >It should state something like "only unmasked loads and stores allowed here" I agree. That…
LoopVectorizationCostModel::CM_Scalarize &&		LoopVectorizationCostModel::CM_Widen &&
		CM.getWideningDecision(I, VF) !=
		LoopVectorizationCostModel::CM_Widen_Reverse) &&
"Memory widening decisions should have been taken care by now");		"Memory widening decisions should have been taken care by now");
return false;		return false;
}		}
return true;		return true;
};		};

if (!LoopVectorizationPlanner::getDecisionAndClampRange(willWiden, Range))		if (!LoopVectorizationPlanner::getDecisionAndClampRange(willWiden, Range))
return false;		return false;
Show All 13 Lines
VPBasicBlock *VPRecipeBuilder::handleReplication(		VPBasicBlock *VPRecipeBuilder::handleReplication(
Instruction I, VFRange &Range, VPBasicBlock VPBB,		Instruction I, VFRange &Range, VPBasicBlock VPBB,
DenseMap<Instruction , VPReplicateRecipe > &PredInst2Recipe,		DenseMap<Instruction , VPReplicateRecipe > &PredInst2Recipe,
VPlanPtr &Plan) {		VPlanPtr &Plan) {
bool IsUniform = LoopVectorizationPlanner::getDecisionAndClampRange(		bool IsUniform = LoopVectorizationPlanner::getDecisionAndClampRange(
[&](unsigned VF) { return CM.isUniformAfterVectorization(I, VF); },		[&](unsigned VF) { return CM.isUniformAfterVectorization(I, VF); },
Range);		Range);

bool IsPredicated = CM.isScalarWithPredication(I);		bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange(
		[&](unsigned VF) { return CM.isScalarWithPredication(I, VF); }, Range);

auto *Recipe = new VPReplicateRecipe(I, IsUniform, IsPredicated);		auto *Recipe = new VPReplicateRecipe(I, IsUniform, IsPredicated);

// Find if I uses a predicated instruction. If so, it will use its scalar		// Find if I uses a predicated instruction. If so, it will use its scalar
// value. Avoid hoisting the insert-element which packs the scalar value into		// value. Avoid hoisting the insert-element which packs the scalar value into
// a vector value, as that happens iff all users use the vector value.		// a vector value, as that happens iff all users use the vector value.
for (auto &Op : I->operands())		for (auto &Op : I->operands())
if (auto *PredInst = dyn_cast<Instruction>(Op))		if (auto *PredInst = dyn_cast<Instruction>(Op))
if (PredInst2Recipe.find(PredInst) != PredInst2Recipe.end())		if (PredInst2Recipe.find(PredInst) != PredInst2Recipe.end())
▲ Show 20 Lines • Show All 863 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/X86/gather_scatter.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -O3 -mcpu=knl -S \| FileCheck %s -check-prefix=AVX512			; RUN: opt < %s -O3 -mcpu=knl -S \| FileCheck %s -check-prefix=AVX512
				; RUN: opt < %s -O3 -mcpu=knl -force-vector-width=2 -S \| FileCheck %s -check-prefix=FVW2

				; With a force-vector-width, it is sometimes more profitable to generate
				; scalarized and predicated stores instead of masked scatter.

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-pc_linux"			target triple = "x86_64-pc_linux"

	; The source code:			; The source code:
	;			;
	;void foo1(float * __restrict__ in, float * __restrict__ out, int * __restrict__ trigger, int * __restrict__ index) {			;void foo1(float * __restrict__ in, float * __restrict__ out, int * __restrict__ trigger, int * __restrict__ index) {
	;			;
	▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP39:%.]] = bitcast float [[TMP38]] to <16 x float>*			; AVX512-NEXT: [[TMP39:%.]] = bitcast float [[TMP38]] to <16 x float>*
	; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP37]], <16 x float>* [[TMP39]], i32 4, <16 x i1> [[TMP32]])			; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP37]], <16 x float>* [[TMP39]], i32 4, <16 x i1> [[TMP32]])
	; AVX512-NEXT: [[INDEX_NEXT_3]] = add nuw nsw i64 [[INDEX6]], 64			; AVX512-NEXT: [[INDEX_NEXT_3]] = add nuw nsw i64 [[INDEX6]], 64
	; AVX512-NEXT: [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT_3]], 4096			; AVX512-NEXT: [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT_3]], 4096
	; AVX512-NEXT: br i1 [[TMP40]], label [[FOR_END:%.*]], label [[VECTOR_BODY]], !llvm.loop !0			; AVX512-NEXT: br i1 [[TMP40]], label [[FOR_END:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
	; AVX512: for.end:			; AVX512: for.end:
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
				; FVW2-LABEL: @foo1(
				; FVW2-NEXT: entry:
				; FVW2-NEXT: br label [[VECTOR_BODY:%.*]]
				; FVW2: vector.body:
				; FVW2-NEXT: [[INDEX6:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
				; FVW2-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[TRIGGER:%.*]], i64 [[INDEX6]]
				; FVW2-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <2 x i32>*
				; FVW2-NEXT: [[WIDE_LOAD:%.]] = load <2 x i32>, <2 x i32> [[TMP1]], align 4
				; FVW2-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 2
				; FVW2-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <2 x i32>*
				; FVW2-NEXT: [[WIDE_LOAD10:%.]] = load <2 x i32>, <2 x i32> [[TMP3]], align 4
				; FVW2-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 4
				; FVW2-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP4]] to <2 x i32>*
				; FVW2-NEXT: [[WIDE_LOAD11:%.]] = load <2 x i32>, <2 x i32> [[TMP5]], align 4
				; FVW2-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 6
				; FVW2-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <2 x i32>*
				; FVW2-NEXT: [[WIDE_LOAD12:%.]] = load <2 x i32>, <2 x i32> [[TMP7]], align 4
				; FVW2-NEXT: [[TMP8:%.*]] = icmp sgt <2 x i32> [[WIDE_LOAD]], zeroinitializer
				; FVW2-NEXT: [[TMP9:%.*]] = icmp sgt <2 x i32> [[WIDE_LOAD10]], zeroinitializer
				; FVW2-NEXT: [[TMP10:%.*]] = icmp sgt <2 x i32> [[WIDE_LOAD11]], zeroinitializer
				; FVW2-NEXT: [[TMP11:%.*]] = icmp sgt <2 x i32> [[WIDE_LOAD12]], zeroinitializer
				; FVW2-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 [[INDEX:%.*]], i64 [[INDEX6]]
				; FVW2-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP12]] to <2 x i32>*
				; FVW2-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32> [[TMP13]], i32 4, <2 x i1> [[TMP8]], <2 x i32> undef)
				; FVW2-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP12]], i64 2
				; FVW2-NEXT: [[TMP15:%.]] = bitcast i32 [[TMP14]] to <2 x i32>*
				; FVW2-NEXT: [[WIDE_MASKED_LOAD13:%.]] = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32> [[TMP15]], i32 4, <2 x i1> [[TMP9]], <2 x i32> undef)
				; FVW2-NEXT: [[TMP16:%.]] = getelementptr inbounds i32, i32 [[TMP12]], i64 4
				; FVW2-NEXT: [[TMP17:%.]] = bitcast i32 [[TMP16]] to <2 x i32>*
				; FVW2-NEXT: [[WIDE_MASKED_LOAD14:%.]] = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32> [[TMP17]], i32 4, <2 x i1> [[TMP10]], <2 x i32> undef)
				; FVW2-NEXT: [[TMP18:%.]] = getelementptr inbounds i32, i32 [[TMP12]], i64 6
				; FVW2-NEXT: [[TMP19:%.]] = bitcast i32 [[TMP18]] to <2 x i32>*
				; FVW2-NEXT: [[WIDE_MASKED_LOAD15:%.]] = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32> [[TMP19]], i32 4, <2 x i1> [[TMP11]], <2 x i32> undef)
				; FVW2-NEXT: [[TMP20:%.*]] = sext <2 x i32> [[WIDE_MASKED_LOAD]] to <2 x i64>
				; FVW2-NEXT: [[TMP21:%.*]] = sext <2 x i32> [[WIDE_MASKED_LOAD13]] to <2 x i64>
				; FVW2-NEXT: [[TMP22:%.*]] = sext <2 x i32> [[WIDE_MASKED_LOAD14]] to <2 x i64>
				; FVW2-NEXT: [[TMP23:%.*]] = sext <2 x i32> [[WIDE_MASKED_LOAD15]] to <2 x i64>
				; FVW2-NEXT: [[TMP24:%.]] = getelementptr inbounds float, float [[IN:%.*]], <2 x i64> [[TMP20]]
				; FVW2-NEXT: [[TMP25:%.]] = getelementptr inbounds float, float [[IN]], <2 x i64> [[TMP21]]
				; FVW2-NEXT: [[TMP26:%.]] = getelementptr inbounds float, float [[IN]], <2 x i64> [[TMP22]]
				; FVW2-NEXT: [[TMP27:%.]] = getelementptr inbounds float, float [[IN]], <2 x i64> [[TMP23]]
				; FVW2-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP24]], i32 4, <2 x i1> [[TMP8]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER16:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP25]], i32 4, <2 x i1> [[TMP9]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER17:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP26]], i32 4, <2 x i1> [[TMP10]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER18:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP27]], i32 4, <2 x i1> [[TMP11]], <2 x float> undef)
				; FVW2-NEXT: [[TMP28:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP29:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER16]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP30:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER17]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP31:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER18]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP32:%.]] = getelementptr inbounds float, float [[OUT:%.*]], i64 [[INDEX6]]
				; FVW2-NEXT: [[TMP33:%.]] = bitcast float [[TMP32]] to <2 x float>*
				; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP28]], <2 x float>* [[TMP33]], i32 4, <2 x i1> [[TMP8]])
				; FVW2-NEXT: [[TMP34:%.]] = getelementptr inbounds float, float [[TMP32]], i64 2
				; FVW2-NEXT: [[TMP35:%.]] = bitcast float [[TMP34]] to <2 x float>*
				; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP29]], <2 x float>* [[TMP35]], i32 4, <2 x i1> [[TMP9]])
				; FVW2-NEXT: [[TMP36:%.]] = getelementptr inbounds float, float [[TMP32]], i64 4
				; FVW2-NEXT: [[TMP37:%.]] = bitcast float [[TMP36]] to <2 x float>*
				; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP30]], <2 x float>* [[TMP37]], i32 4, <2 x i1> [[TMP10]])
				; FVW2-NEXT: [[TMP38:%.]] = getelementptr inbounds float, float [[TMP32]], i64 6
				; FVW2-NEXT: [[TMP39:%.]] = bitcast float [[TMP38]] to <2 x float>*
				; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP31]], <2 x float>* [[TMP39]], i32 4, <2 x i1> [[TMP11]])
				; FVW2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX6]], 8
				; FVW2-NEXT: [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096
				; FVW2-NEXT: br i1 [[TMP40]], label [[FOR_END:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
				; FVW2: for.end:
				; FVW2-NEXT: ret void
				;
	entry:			entry:
	%in.addr = alloca float*, align 8			%in.addr = alloca float*, align 8
	%out.addr = alloca float*, align 8			%out.addr = alloca float*, align 8
	%trigger.addr = alloca i32*, align 8			%trigger.addr = alloca i32*, align 8
	%index.addr = alloca i32*, align 8			%index.addr = alloca i32*, align 8
	%i = alloca i32, align 4			%i = alloca i32, align 4
	store float* %in, float** %in.addr, align 8			store float* %in, float** %in.addr, align 8
	store float* %out, float** %out.addr, align 8			store float* %out, float** %out.addr, align 8
	▲ Show 20 Lines • Show All 187 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP76:%.*]] = icmp sgt <16 x i32> [[WIDE_MASKED_GATHER_15]], zeroinitializer			; AVX512-NEXT: [[TMP76:%.*]] = icmp sgt <16 x i32> [[WIDE_MASKED_GATHER_15]], zeroinitializer
	; AVX512-NEXT: [[TMP77:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>, i32 1			; AVX512-NEXT: [[TMP77:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>, i32 1
	; AVX512-NEXT: [[WIDE_MASKED_GATHER7_15:%.]] = call <16 x float> @llvm.masked.gather.v16f32.v16p0f32(<16 x float> [[TMP77]], i32 4, <16 x i1> [[TMP76]], <16 x float> undef)			; AVX512-NEXT: [[WIDE_MASKED_GATHER7_15:%.]] = call <16 x float> @llvm.masked.gather.v16f32.v16p0f32(<16 x float> [[TMP77]], i32 4, <16 x i1> [[TMP76]], <16 x float> undef)
	; AVX512-NEXT: [[TMP78:%.*]] = fadd <16 x float> [[WIDE_MASKED_GATHER7_15]], <float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01>			; AVX512-NEXT: [[TMP78:%.*]] = fadd <16 x float> [[WIDE_MASKED_GATHER7_15]], <float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01>
	; AVX512-NEXT: [[TMP79:%.]] = getelementptr inbounds float, float [[OUT]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>			; AVX512-NEXT: [[TMP79:%.]] = getelementptr inbounds float, float [[OUT]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>
	; AVX512-NEXT: call void @llvm.masked.scatter.v16f32.v16p0f32(<16 x float> [[TMP78]], <16 x float*> [[TMP79]], i32 4, <16 x i1> [[TMP76]])			; AVX512-NEXT: call void @llvm.masked.scatter.v16f32.v16p0f32(<16 x float> [[TMP78]], <16 x float*> [[TMP79]], i32 4, <16 x i1> [[TMP76]])
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
				; FVW2-LABEL: @foo2(
				; FVW2-NEXT: entry:
				; FVW2-NEXT: br label [[VECTOR_BODY:%.*]]
				; FVW2: vector.body:
				; FVW2-NEXT: [[INDEX6:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDEX_NEXT:%.]], [[PRED_STORE_CONTINUE30:%.]] ]
				; FVW2-NEXT: [[VEC_IND:%.]] = phi <2 x i64> [ <i64 0, i64 16>, [[ENTRY]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_STORE_CONTINUE30]] ]
				; FVW2-NEXT: [[STEP_ADD:%.*]] = add <2 x i64> [[VEC_IND]], <i64 32, i64 32>
				; FVW2-NEXT: [[STEP_ADD7:%.*]] = add <2 x i64> [[VEC_IND]], <i64 64, i64 64>
				; FVW2-NEXT: [[STEP_ADD8:%.*]] = add <2 x i64> [[VEC_IND]], <i64 96, i64 96>
				; FVW2-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX6]], 4
				; FVW2-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[TRIGGER:%.*]], <2 x i64> [[VEC_IND]]
				; FVW2-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD]]
				; FVW2-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD7]]
				; FVW2-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD8]]
				; FVW2-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP0]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER10:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP1]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER11:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP2]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER12:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP3]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[TMP4:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER]], zeroinitializer
				; FVW2-NEXT: [[TMP5:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER10]], zeroinitializer
				; FVW2-NEXT: [[TMP6:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER11]], zeroinitializer
				; FVW2-NEXT: [[TMP7:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER12]], zeroinitializer
				; FVW2-NEXT: [[TMP8:%.]] = getelementptr inbounds [[STRUCT_IN:%.]], %struct.In* [[IN:%.*]], <2 x i64> [[VEC_IND]], i32 1
				; FVW2-NEXT: [[TMP9:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <2 x i64> [[STEP_ADD]], i32 1
				; FVW2-NEXT: [[TMP10:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <2 x i64> [[STEP_ADD7]], i32 1
				; FVW2-NEXT: [[TMP11:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <2 x i64> [[STEP_ADD8]], i32 1
				; FVW2-NEXT: [[WIDE_MASKED_GATHER13:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP8]], i32 4, <2 x i1> [[TMP4]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER14:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP9]], i32 4, <2 x i1> [[TMP5]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER15:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP10]], i32 4, <2 x i1> [[TMP6]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER16:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP11]], i32 4, <2 x i1> [[TMP7]], <2 x float> undef)
				; FVW2-NEXT: [[TMP12:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER13]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP13:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER14]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP14:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER15]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP15:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER16]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP16:%.*]] = extractelement <2 x i1> [[TMP4]], i32 0
				; FVW2-NEXT: br i1 [[TMP16]], label [[PRED_STORE_IF:%.]], label [[PRED_STORE_CONTINUE:%.]]
				; FVW2: pred.store.if:
				; FVW2-NEXT: [[TMP17:%.]] = getelementptr inbounds float, float [[OUT:%.*]], i64 [[OFFSET_IDX]]
				; FVW2-NEXT: [[TMP18:%.*]] = extractelement <2 x float> [[TMP12]], i32 0
				; FVW2-NEXT: store float [[TMP18]], float* [[TMP17]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE]]
				; FVW2: pred.store.continue:
				; FVW2-NEXT: [[TMP19:%.*]] = extractelement <2 x i1> [[TMP4]], i32 1
				; FVW2-NEXT: br i1 [[TMP19]], label [[PRED_STORE_IF17:%.]], label [[PRED_STORE_CONTINUE18:%.]]
				; FVW2: pred.store.if17:
				; FVW2-NEXT: [[TMP20:%.*]] = or i64 [[OFFSET_IDX]], 16
				; FVW2-NEXT: [[TMP21:%.]] = getelementptr inbounds float, float [[OUT]], i64 [[TMP20]]
				; FVW2-NEXT: [[TMP22:%.*]] = extractelement <2 x float> [[TMP12]], i32 1
				; FVW2-NEXT: store float [[TMP22]], float* [[TMP21]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE18]]
				; FVW2: pred.store.continue18:
				; FVW2-NEXT: [[TMP23:%.*]] = extractelement <2 x i1> [[TMP5]], i32 0
				; FVW2-NEXT: br i1 [[TMP23]], label [[PRED_STORE_IF19:%.]], label [[PRED_STORE_CONTINUE20:%.]]
				; FVW2: pred.store.if19:
				; FVW2-NEXT: [[TMP24:%.*]] = or i64 [[OFFSET_IDX]], 32
				; FVW2-NEXT: [[TMP25:%.]] = getelementptr inbounds float, float [[OUT]], i64 [[TMP24]]
				; FVW2-NEXT: [[TMP26:%.*]] = extractelement <2 x float> [[TMP13]], i32 0
				; FVW2-NEXT: store float [[TMP26]], float* [[TMP25]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE20]]
				; FVW2: pred.store.continue20:
				; FVW2-NEXT: [[TMP27:%.*]] = extractelement <2 x i1> [[TMP5]], i32 1
				; FVW2-NEXT: br i1 [[TMP27]], label [[PRED_STORE_IF21:%.]], label [[PRED_STORE_CONTINUE22:%.]]
				; FVW2: pred.store.if21:
				; FVW2-NEXT: [[TMP28:%.*]] = or i64 [[OFFSET_IDX]], 48
				; FVW2-NEXT: [[TMP29:%.]] = getelementptr inbounds float, float [[OUT]], i64 [[TMP28]]
				; FVW2-NEXT: [[TMP30:%.*]] = extractelement <2 x float> [[TMP13]], i32 1
				; FVW2-NEXT: store float [[TMP30]], float* [[TMP29]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE22]]
				; FVW2: pred.store.continue22:
				; FVW2-NEXT: [[TMP31:%.*]] = extractelement <2 x i1> [[TMP6]], i32 0
				; FVW2-NEXT: br i1 [[TMP31]], label [[PRED_STORE_IF23:%.]], label [[PRED_STORE_CONTINUE24:%.]]
				; FVW2: pred.store.if23:
				; FVW2-NEXT: [[TMP32:%.*]] = or i64 [[OFFSET_IDX]], 64
				; FVW2-NEXT: [[TMP33:%.]] = getelementptr inbounds float, float [[OUT]], i64 [[TMP32]]
				; FVW2-NEXT: [[TMP34:%.*]] = extractelement <2 x float> [[TMP14]], i32 0
				; FVW2-NEXT: store float [[TMP34]], float* [[TMP33]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE24]]
				; FVW2: pred.store.continue24:
				; FVW2-NEXT: [[TMP35:%.*]] = extractelement <2 x i1> [[TMP6]], i32 1
				; FVW2-NEXT: br i1 [[TMP35]], label [[PRED_STORE_IF25:%.]], label [[PRED_STORE_CONTINUE26:%.]]
				; FVW2: pred.store.if25:
				; FVW2-NEXT: [[TMP36:%.*]] = or i64 [[OFFSET_IDX]], 80
				; FVW2-NEXT: [[TMP37:%.]] = getelementptr inbounds float, float [[OUT]], i64 [[TMP36]]
				; FVW2-NEXT: [[TMP38:%.*]] = extractelement <2 x float> [[TMP14]], i32 1
				; FVW2-NEXT: store float [[TMP38]], float* [[TMP37]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE26]]
				; FVW2: pred.store.continue26:
				; FVW2-NEXT: [[TMP39:%.*]] = extractelement <2 x i1> [[TMP7]], i32 0
				; FVW2-NEXT: br i1 [[TMP39]], label [[PRED_STORE_IF27:%.]], label [[PRED_STORE_CONTINUE28:%.]]
				; FVW2: pred.store.if27:
				; FVW2-NEXT: [[TMP40:%.*]] = or i64 [[OFFSET_IDX]], 96
				; FVW2-NEXT: [[TMP41:%.]] = getelementptr inbounds float, float [[OUT]], i64 [[TMP40]]
				; FVW2-NEXT: [[TMP42:%.*]] = extractelement <2 x float> [[TMP15]], i32 0
				; FVW2-NEXT: store float [[TMP42]], float* [[TMP41]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE28]]
				; FVW2: pred.store.continue28:
				; FVW2-NEXT: [[TMP43:%.*]] = extractelement <2 x i1> [[TMP7]], i32 1
				; FVW2-NEXT: br i1 [[TMP43]], label [[PRED_STORE_IF29:%.*]], label [[PRED_STORE_CONTINUE30]]
				; FVW2: pred.store.if29:
				; FVW2-NEXT: [[TMP44:%.*]] = or i64 [[OFFSET_IDX]], 112
				; FVW2-NEXT: [[TMP45:%.]] = getelementptr inbounds float, float [[OUT]], i64 [[TMP44]]
				; FVW2-NEXT: [[TMP46:%.*]] = extractelement <2 x float> [[TMP15]], i32 1
				; FVW2-NEXT: store float [[TMP46]], float* [[TMP45]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE30]]
				; FVW2: pred.store.continue30:
				; FVW2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX6]], 8
				; FVW2-NEXT: [[VEC_IND_NEXT]] = add <2 x i64> [[VEC_IND]], <i64 128, i64 128>
				; FVW2-NEXT: [[TMP47:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
				; FVW2-NEXT: br i1 [[TMP47]], label [[FOR_END:%.*]], label [[VECTOR_BODY]], !llvm.loop !2
				; FVW2: for.end:
				; FVW2-NEXT: ret void
				;
	entry:			entry:
	%in.addr = alloca %struct.In*, align 8			%in.addr = alloca %struct.In*, align 8
	%out.addr = alloca float*, align 8			%out.addr = alloca float*, align 8
	%trigger.addr = alloca i32*, align 8			%trigger.addr = alloca i32*, align 8
	%index.addr = alloca i32*, align 8			%index.addr = alloca i32*, align 8
	%i = alloca i32, align 4			%i = alloca i32, align 4
	store %struct.In* %in, %struct.In** %in.addr, align 8			store %struct.In* %in, %struct.In** %in.addr, align 8
	store float* %out, float** %out.addr, align 8			store float* %out, float** %out.addr, align 8
	▲ Show 20 Lines • Show All 188 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP76:%.*]] = icmp sgt <16 x i32> [[WIDE_MASKED_GATHER_15]], zeroinitializer			; AVX512-NEXT: [[TMP76:%.*]] = icmp sgt <16 x i32> [[WIDE_MASKED_GATHER_15]], zeroinitializer
	; AVX512-NEXT: [[TMP77:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>, i32 1			; AVX512-NEXT: [[TMP77:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>, i32 1
	; AVX512-NEXT: [[WIDE_MASKED_GATHER6_15:%.]] = call <16 x float> @llvm.masked.gather.v16f32.v16p0f32(<16 x float> [[TMP77]], i32 4, <16 x i1> [[TMP76]], <16 x float> undef)			; AVX512-NEXT: [[WIDE_MASKED_GATHER6_15:%.]] = call <16 x float> @llvm.masked.gather.v16f32.v16p0f32(<16 x float> [[TMP77]], i32 4, <16 x i1> [[TMP76]], <16 x float> undef)
	; AVX512-NEXT: [[TMP78:%.*]] = fadd <16 x float> [[WIDE_MASKED_GATHER6_15]], <float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01>			; AVX512-NEXT: [[TMP78:%.*]] = fadd <16 x float> [[WIDE_MASKED_GATHER6_15]], <float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01>
	; AVX512-NEXT: [[TMP79:%.]] = getelementptr inbounds [[STRUCT_OUT]], %struct.Out [[OUT]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>, i32 1			; AVX512-NEXT: [[TMP79:%.]] = getelementptr inbounds [[STRUCT_OUT]], %struct.Out [[OUT]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>, i32 1
	; AVX512-NEXT: call void @llvm.masked.scatter.v16f32.v16p0f32(<16 x float> [[TMP78]], <16 x float*> [[TMP79]], i32 4, <16 x i1> [[TMP76]])			; AVX512-NEXT: call void @llvm.masked.scatter.v16f32.v16p0f32(<16 x float> [[TMP78]], <16 x float*> [[TMP79]], i32 4, <16 x i1> [[TMP76]])
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
				; FVW2-LABEL: @foo3(
				; FVW2-NEXT: entry:
				; FVW2-NEXT: br label [[VECTOR_BODY:%.*]]
				; FVW2: vector.body:
				; FVW2-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDEX_NEXT:%.]], [[PRED_STORE_CONTINUE29:%.]] ]
				; FVW2-NEXT: [[VEC_IND:%.]] = phi <2 x i64> [ <i64 0, i64 16>, [[ENTRY]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_STORE_CONTINUE29]] ]
				; FVW2-NEXT: [[STEP_ADD:%.*]] = add <2 x i64> [[VEC_IND]], <i64 32, i64 32>
				; FVW2-NEXT: [[STEP_ADD6:%.*]] = add <2 x i64> [[VEC_IND]], <i64 64, i64 64>
				; FVW2-NEXT: [[STEP_ADD7:%.*]] = add <2 x i64> [[VEC_IND]], <i64 96, i64 96>
				; FVW2-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX]], 4
				; FVW2-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[TRIGGER:%.*]], <2 x i64> [[VEC_IND]]
				; FVW2-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD]]
				; FVW2-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD6]]
				; FVW2-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD7]]
				; FVW2-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP0]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER9:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP1]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER10:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP2]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER11:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP3]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[TMP4:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER]], zeroinitializer
				; FVW2-NEXT: [[TMP5:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER9]], zeroinitializer
				; FVW2-NEXT: [[TMP6:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER10]], zeroinitializer
				; FVW2-NEXT: [[TMP7:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER11]], zeroinitializer
				; FVW2-NEXT: [[TMP8:%.]] = getelementptr inbounds [[STRUCT_IN:%.]], %struct.In* [[IN:%.*]], <2 x i64> [[VEC_IND]], i32 1
				; FVW2-NEXT: [[TMP9:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <2 x i64> [[STEP_ADD]], i32 1
				; FVW2-NEXT: [[TMP10:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <2 x i64> [[STEP_ADD6]], i32 1
				; FVW2-NEXT: [[TMP11:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <2 x i64> [[STEP_ADD7]], i32 1
				; FVW2-NEXT: [[WIDE_MASKED_GATHER12:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP8]], i32 4, <2 x i1> [[TMP4]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER13:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP9]], i32 4, <2 x i1> [[TMP5]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER14:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP10]], i32 4, <2 x i1> [[TMP6]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER15:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP11]], i32 4, <2 x i1> [[TMP7]], <2 x float> undef)
				; FVW2-NEXT: [[TMP12:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER12]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP13:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER13]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP14:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER14]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP15:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER15]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP16:%.*]] = extractelement <2 x i1> [[TMP4]], i32 0
				; FVW2-NEXT: br i1 [[TMP16]], label [[PRED_STORE_IF:%.]], label [[PRED_STORE_CONTINUE:%.]]
				; FVW2: pred.store.if:
				; FVW2-NEXT: [[TMP17:%.]] = getelementptr inbounds [[STRUCT_OUT:%.]], %struct.Out* [[OUT:%.*]], i64 [[OFFSET_IDX]], i32 1
				; FVW2-NEXT: [[TMP18:%.*]] = extractelement <2 x float> [[TMP12]], i32 0
				; FVW2-NEXT: store float [[TMP18]], float* [[TMP17]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE]]
				; FVW2: pred.store.continue:
				; FVW2-NEXT: [[TMP19:%.*]] = extractelement <2 x i1> [[TMP4]], i32 1
				; FVW2-NEXT: br i1 [[TMP19]], label [[PRED_STORE_IF16:%.]], label [[PRED_STORE_CONTINUE17:%.]]
				; FVW2: pred.store.if16:
				; FVW2-NEXT: [[TMP20:%.*]] = or i64 [[OFFSET_IDX]], 16
				; FVW2-NEXT: [[TMP21:%.]] = getelementptr inbounds [[STRUCT_OUT]], %struct.Out [[OUT]], i64 [[TMP20]], i32 1
				; FVW2-NEXT: [[TMP22:%.*]] = extractelement <2 x float> [[TMP12]], i32 1
				; FVW2-NEXT: store float [[TMP22]], float* [[TMP21]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE17]]
				; FVW2: pred.store.continue17:
				; FVW2-NEXT: [[TMP23:%.*]] = extractelement <2 x i1> [[TMP5]], i32 0
				; FVW2-NEXT: br i1 [[TMP23]], label [[PRED_STORE_IF18:%.]], label [[PRED_STORE_CONTINUE19:%.]]
				; FVW2: pred.store.if18:
				; FVW2-NEXT: [[TMP24:%.*]] = or i64 [[OFFSET_IDX]], 32
				; FVW2-NEXT: [[TMP25:%.]] = getelementptr inbounds [[STRUCT_OUT]], %struct.Out [[OUT]], i64 [[TMP24]], i32 1
				; FVW2-NEXT: [[TMP26:%.*]] = extractelement <2 x float> [[TMP13]], i32 0
				; FVW2-NEXT: store float [[TMP26]], float* [[TMP25]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE19]]
				; FVW2: pred.store.continue19:
				; FVW2-NEXT: [[TMP27:%.*]] = extractelement <2 x i1> [[TMP5]], i32 1
				; FVW2-NEXT: br i1 [[TMP27]], label [[PRED_STORE_IF20:%.]], label [[PRED_STORE_CONTINUE21:%.]]
				; FVW2: pred.store.if20:
				; FVW2-NEXT: [[TMP28:%.*]] = or i64 [[OFFSET_IDX]], 48
				; FVW2-NEXT: [[TMP29:%.]] = getelementptr inbounds [[STRUCT_OUT]], %struct.Out [[OUT]], i64 [[TMP28]], i32 1
				; FVW2-NEXT: [[TMP30:%.*]] = extractelement <2 x float> [[TMP13]], i32 1
				; FVW2-NEXT: store float [[TMP30]], float* [[TMP29]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE21]]
				; FVW2: pred.store.continue21:
				; FVW2-NEXT: [[TMP31:%.*]] = extractelement <2 x i1> [[TMP6]], i32 0
				; FVW2-NEXT: br i1 [[TMP31]], label [[PRED_STORE_IF22:%.]], label [[PRED_STORE_CONTINUE23:%.]]
				; FVW2: pred.store.if22:
				; FVW2-NEXT: [[TMP32:%.*]] = or i64 [[OFFSET_IDX]], 64
				; FVW2-NEXT: [[TMP33:%.]] = getelementptr inbounds [[STRUCT_OUT]], %struct.Out [[OUT]], i64 [[TMP32]], i32 1
				; FVW2-NEXT: [[TMP34:%.*]] = extractelement <2 x float> [[TMP14]], i32 0
				; FVW2-NEXT: store float [[TMP34]], float* [[TMP33]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE23]]
				; FVW2: pred.store.continue23:
				; FVW2-NEXT: [[TMP35:%.*]] = extractelement <2 x i1> [[TMP6]], i32 1
				; FVW2-NEXT: br i1 [[TMP35]], label [[PRED_STORE_IF24:%.]], label [[PRED_STORE_CONTINUE25:%.]]
				; FVW2: pred.store.if24:
				; FVW2-NEXT: [[TMP36:%.*]] = or i64 [[OFFSET_IDX]], 80
				; FVW2-NEXT: [[TMP37:%.]] = getelementptr inbounds [[STRUCT_OUT]], %struct.Out [[OUT]], i64 [[TMP36]], i32 1
				; FVW2-NEXT: [[TMP38:%.*]] = extractelement <2 x float> [[TMP14]], i32 1
				; FVW2-NEXT: store float [[TMP38]], float* [[TMP37]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE25]]
				; FVW2: pred.store.continue25:
				; FVW2-NEXT: [[TMP39:%.*]] = extractelement <2 x i1> [[TMP7]], i32 0
				; FVW2-NEXT: br i1 [[TMP39]], label [[PRED_STORE_IF26:%.]], label [[PRED_STORE_CONTINUE27:%.]]
				; FVW2: pred.store.if26:
				; FVW2-NEXT: [[TMP40:%.*]] = or i64 [[OFFSET_IDX]], 96
				; FVW2-NEXT: [[TMP41:%.]] = getelementptr inbounds [[STRUCT_OUT]], %struct.Out [[OUT]], i64 [[TMP40]], i32 1
				; FVW2-NEXT: [[TMP42:%.*]] = extractelement <2 x float> [[TMP15]], i32 0
				; FVW2-NEXT: store float [[TMP42]], float* [[TMP41]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE27]]
				; FVW2: pred.store.continue27:
				; FVW2-NEXT: [[TMP43:%.*]] = extractelement <2 x i1> [[TMP7]], i32 1
				; FVW2-NEXT: br i1 [[TMP43]], label [[PRED_STORE_IF28:%.*]], label [[PRED_STORE_CONTINUE29]]
				; FVW2: pred.store.if28:
				; FVW2-NEXT: [[TMP44:%.*]] = or i64 [[OFFSET_IDX]], 112
				; FVW2-NEXT: [[TMP45:%.]] = getelementptr inbounds [[STRUCT_OUT]], %struct.Out [[OUT]], i64 [[TMP44]], i32 1
				; FVW2-NEXT: [[TMP46:%.*]] = extractelement <2 x float> [[TMP15]], i32 1
				; FVW2-NEXT: store float [[TMP46]], float* [[TMP45]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE29]]
				; FVW2: pred.store.continue29:
				; FVW2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
				; FVW2-NEXT: [[VEC_IND_NEXT]] = add <2 x i64> [[VEC_IND]], <i64 128, i64 128>
				; FVW2-NEXT: [[TMP47:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
				; FVW2-NEXT: br i1 [[TMP47]], label [[FOR_END:%.*]], label [[VECTOR_BODY]], !llvm.loop !3
				; FVW2: for.end:
				; FVW2-NEXT: ret void
				;
	entry:			entry:
	%in.addr = alloca %struct.In*, align 8			%in.addr = alloca %struct.In*, align 8
	%out.addr = alloca %struct.Out*, align 8			%out.addr = alloca %struct.Out*, align 8
	%trigger.addr = alloca i32*, align 8			%trigger.addr = alloca i32*, align 8
	%i = alloca i32, align 4			%i = alloca i32, align 4
	store %struct.In* %in, %struct.In** %in.addr, align 8			store %struct.In* %in, %struct.In** %in.addr, align 8
	store %struct.Out* %out, %struct.Out** %out.addr, align 8			store %struct.Out* %out, %struct.Out** %out.addr, align 8
	store i32* %trigger, i32** %trigger.addr, align 8			store i32* %trigger, i32** %trigger.addr, align 8
	▲ Show 20 Lines • Show All 174 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP76:%.*]] = icmp sgt <16 x i32> [[WIDE_MASKED_GATHER_15]], zeroinitializer			; AVX512-NEXT: [[TMP76:%.*]] = icmp sgt <16 x i32> [[WIDE_MASKED_GATHER_15]], zeroinitializer
	; AVX512-NEXT: [[TMP77:%.]] = getelementptr inbounds [[STRUCT_IN]], [[STRUCT_IN]] addrspace(1) [[IN]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>, i32 1			; AVX512-NEXT: [[TMP77:%.]] = getelementptr inbounds [[STRUCT_IN]], [[STRUCT_IN]] addrspace(1) [[IN]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>, i32 1
	; AVX512-NEXT: [[WIDE_MASKED_GATHER7_15:%.]] = call <16 x float> @llvm.masked.gather.v16f32.v16p1f32(<16 x float addrspace(1)> [[TMP77]], i32 4, <16 x i1> [[TMP76]], <16 x float> undef)			; AVX512-NEXT: [[WIDE_MASKED_GATHER7_15:%.]] = call <16 x float> @llvm.masked.gather.v16f32.v16p1f32(<16 x float addrspace(1)> [[TMP77]], i32 4, <16 x i1> [[TMP76]], <16 x float> undef)
	; AVX512-NEXT: [[TMP78:%.*]] = fadd <16 x float> [[WIDE_MASKED_GATHER7_15]], <float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01>			; AVX512-NEXT: [[TMP78:%.*]] = fadd <16 x float> [[WIDE_MASKED_GATHER7_15]], <float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01>
	; AVX512-NEXT: [[TMP79:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>			; AVX512-NEXT: [[TMP79:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>
	; AVX512-NEXT: call void @llvm.masked.scatter.v16f32.v16p1f32(<16 x float> [[TMP78]], <16 x float addrspace(1)*> [[TMP79]], i32 4, <16 x i1> [[TMP76]])			; AVX512-NEXT: call void @llvm.masked.scatter.v16f32.v16p1f32(<16 x float> [[TMP78]], <16 x float addrspace(1)*> [[TMP79]], i32 4, <16 x i1> [[TMP76]])
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
				; FVW2-LABEL: @foo2_addrspace(
				; FVW2-NEXT: entry:
				; FVW2-NEXT: br label [[VECTOR_BODY:%.*]]
				; FVW2: vector.body:
				; FVW2-NEXT: [[INDEX6:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDEX_NEXT:%.]], [[PRED_STORE_CONTINUE30:%.]] ]
				; FVW2-NEXT: [[VEC_IND:%.]] = phi <2 x i64> [ <i64 0, i64 16>, [[ENTRY]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_STORE_CONTINUE30]] ]
				; FVW2-NEXT: [[STEP_ADD:%.*]] = add <2 x i64> [[VEC_IND]], <i64 32, i64 32>
				; FVW2-NEXT: [[STEP_ADD7:%.*]] = add <2 x i64> [[VEC_IND]], <i64 64, i64 64>
				; FVW2-NEXT: [[STEP_ADD8:%.*]] = add <2 x i64> [[VEC_IND]], <i64 96, i64 96>
				; FVW2-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX6]], 4
				; FVW2-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[TRIGGER:%.*]], <2 x i64> [[VEC_IND]]
				; FVW2-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD]]
				; FVW2-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD7]]
				; FVW2-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD8]]
				; FVW2-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP0]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER10:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP1]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER11:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP2]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER12:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP3]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[TMP4:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER]], zeroinitializer
				; FVW2-NEXT: [[TMP5:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER10]], zeroinitializer
				; FVW2-NEXT: [[TMP6:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER11]], zeroinitializer
				; FVW2-NEXT: [[TMP7:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER12]], zeroinitializer
				; FVW2-NEXT: [[TMP8:%.]] = getelementptr inbounds [[STRUCT_IN:%.]], [[STRUCT_IN]] addrspace(1)* [[IN:%.*]], <2 x i64> [[VEC_IND]], i32 1
				; FVW2-NEXT: [[TMP9:%.]] = getelementptr inbounds [[STRUCT_IN]], [[STRUCT_IN]] addrspace(1) [[IN]], <2 x i64> [[STEP_ADD]], i32 1
				; FVW2-NEXT: [[TMP10:%.]] = getelementptr inbounds [[STRUCT_IN]], [[STRUCT_IN]] addrspace(1) [[IN]], <2 x i64> [[STEP_ADD7]], i32 1
				; FVW2-NEXT: [[TMP11:%.]] = getelementptr inbounds [[STRUCT_IN]], [[STRUCT_IN]] addrspace(1) [[IN]], <2 x i64> [[STEP_ADD8]], i32 1
				; FVW2-NEXT: [[WIDE_MASKED_GATHER13:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p1f32(<2 x float addrspace(1)> [[TMP8]], i32 4, <2 x i1> [[TMP4]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER14:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p1f32(<2 x float addrspace(1)> [[TMP9]], i32 4, <2 x i1> [[TMP5]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER15:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p1f32(<2 x float addrspace(1)> [[TMP10]], i32 4, <2 x i1> [[TMP6]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER16:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p1f32(<2 x float addrspace(1)> [[TMP11]], i32 4, <2 x i1> [[TMP7]], <2 x float> undef)
				; FVW2-NEXT: [[TMP12:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER13]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP13:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER14]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP14:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER15]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP15:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER16]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP16:%.*]] = extractelement <2 x i1> [[TMP4]], i32 0
				; FVW2-NEXT: br i1 [[TMP16]], label [[PRED_STORE_IF:%.]], label [[PRED_STORE_CONTINUE:%.]]
				; FVW2: pred.store.if:
				; FVW2-NEXT: [[TMP17:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT:%.*]], i64 [[OFFSET_IDX]]
				; FVW2-NEXT: [[TMP18:%.*]] = extractelement <2 x float> [[TMP12]], i32 0
				; FVW2-NEXT: store float [[TMP18]], float addrspace(1)* [[TMP17]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE]]
				; FVW2: pred.store.continue:
				; FVW2-NEXT: [[TMP19:%.*]] = extractelement <2 x i1> [[TMP4]], i32 1
				; FVW2-NEXT: br i1 [[TMP19]], label [[PRED_STORE_IF17:%.]], label [[PRED_STORE_CONTINUE18:%.]]
				; FVW2: pred.store.if17:
				; FVW2-NEXT: [[TMP20:%.*]] = or i64 [[OFFSET_IDX]], 16
				; FVW2-NEXT: [[TMP21:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], i64 [[TMP20]]
				; FVW2-NEXT: [[TMP22:%.*]] = extractelement <2 x float> [[TMP12]], i32 1
				; FVW2-NEXT: store float [[TMP22]], float addrspace(1)* [[TMP21]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE18]]
				; FVW2: pred.store.continue18:
				; FVW2-NEXT: [[TMP23:%.*]] = extractelement <2 x i1> [[TMP5]], i32 0
				; FVW2-NEXT: br i1 [[TMP23]], label [[PRED_STORE_IF19:%.]], label [[PRED_STORE_CONTINUE20:%.]]
				; FVW2: pred.store.if19:
				; FVW2-NEXT: [[TMP24:%.*]] = or i64 [[OFFSET_IDX]], 32
				; FVW2-NEXT: [[TMP25:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], i64 [[TMP24]]
				; FVW2-NEXT: [[TMP26:%.*]] = extractelement <2 x float> [[TMP13]], i32 0
				; FVW2-NEXT: store float [[TMP26]], float addrspace(1)* [[TMP25]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE20]]
				; FVW2: pred.store.continue20:
				; FVW2-NEXT: [[TMP27:%.*]] = extractelement <2 x i1> [[TMP5]], i32 1
				; FVW2-NEXT: br i1 [[TMP27]], label [[PRED_STORE_IF21:%.]], label [[PRED_STORE_CONTINUE22:%.]]
				; FVW2: pred.store.if21:
				; FVW2-NEXT: [[TMP28:%.*]] = or i64 [[OFFSET_IDX]], 48
				; FVW2-NEXT: [[TMP29:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], i64 [[TMP28]]
				; FVW2-NEXT: [[TMP30:%.*]] = extractelement <2 x float> [[TMP13]], i32 1
				; FVW2-NEXT: store float [[TMP30]], float addrspace(1)* [[TMP29]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE22]]
				; FVW2: pred.store.continue22:
				; FVW2-NEXT: [[TMP31:%.*]] = extractelement <2 x i1> [[TMP6]], i32 0
				; FVW2-NEXT: br i1 [[TMP31]], label [[PRED_STORE_IF23:%.]], label [[PRED_STORE_CONTINUE24:%.]]
				; FVW2: pred.store.if23:
				; FVW2-NEXT: [[TMP32:%.*]] = or i64 [[OFFSET_IDX]], 64
				; FVW2-NEXT: [[TMP33:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], i64 [[TMP32]]
				; FVW2-NEXT: [[TMP34:%.*]] = extractelement <2 x float> [[TMP14]], i32 0
				; FVW2-NEXT: store float [[TMP34]], float addrspace(1)* [[TMP33]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE24]]
				; FVW2: pred.store.continue24:
				; FVW2-NEXT: [[TMP35:%.*]] = extractelement <2 x i1> [[TMP6]], i32 1
				; FVW2-NEXT: br i1 [[TMP35]], label [[PRED_STORE_IF25:%.]], label [[PRED_STORE_CONTINUE26:%.]]
				; FVW2: pred.store.if25:
				; FVW2-NEXT: [[TMP36:%.*]] = or i64 [[OFFSET_IDX]], 80
				; FVW2-NEXT: [[TMP37:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], i64 [[TMP36]]
				; FVW2-NEXT: [[TMP38:%.*]] = extractelement <2 x float> [[TMP14]], i32 1
				; FVW2-NEXT: store float [[TMP38]], float addrspace(1)* [[TMP37]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE26]]
				; FVW2: pred.store.continue26:
				; FVW2-NEXT: [[TMP39:%.*]] = extractelement <2 x i1> [[TMP7]], i32 0
				; FVW2-NEXT: br i1 [[TMP39]], label [[PRED_STORE_IF27:%.]], label [[PRED_STORE_CONTINUE28:%.]]
				; FVW2: pred.store.if27:
				; FVW2-NEXT: [[TMP40:%.*]] = or i64 [[OFFSET_IDX]], 96
				; FVW2-NEXT: [[TMP41:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], i64 [[TMP40]]
				; FVW2-NEXT: [[TMP42:%.*]] = extractelement <2 x float> [[TMP15]], i32 0
				; FVW2-NEXT: store float [[TMP42]], float addrspace(1)* [[TMP41]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE28]]
				; FVW2: pred.store.continue28:
				; FVW2-NEXT: [[TMP43:%.*]] = extractelement <2 x i1> [[TMP7]], i32 1
				; FVW2-NEXT: br i1 [[TMP43]], label [[PRED_STORE_IF29:%.*]], label [[PRED_STORE_CONTINUE30]]
				; FVW2: pred.store.if29:
				; FVW2-NEXT: [[TMP44:%.*]] = or i64 [[OFFSET_IDX]], 112
				; FVW2-NEXT: [[TMP45:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], i64 [[TMP44]]
				; FVW2-NEXT: [[TMP46:%.*]] = extractelement <2 x float> [[TMP15]], i32 1
				; FVW2-NEXT: store float [[TMP46]], float addrspace(1)* [[TMP45]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE30]]
				; FVW2: pred.store.continue30:
				; FVW2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX6]], 8
				; FVW2-NEXT: [[VEC_IND_NEXT]] = add <2 x i64> [[VEC_IND]], <i64 128, i64 128>
				; FVW2-NEXT: [[TMP47:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
				; FVW2-NEXT: br i1 [[TMP47]], label [[FOR_END:%.*]], label [[VECTOR_BODY]], !llvm.loop !4
				; FVW2: for.end:
				; FVW2-NEXT: ret void
				;
	entry:			entry:
	%in.addr = alloca %struct.In addrspace(1)*, align 8			%in.addr = alloca %struct.In addrspace(1)*, align 8
	%out.addr = alloca float addrspace(1)*, align 8			%out.addr = alloca float addrspace(1)*, align 8
	%trigger.addr = alloca i32*, align 8			%trigger.addr = alloca i32*, align 8
	%index.addr = alloca i32*, align 8			%index.addr = alloca i32*, align 8
	%i = alloca i32, align 4			%i = alloca i32, align 4
	store %struct.In addrspace(1)* %in, %struct.In addrspace(1)** %in.addr, align 8			store %struct.In addrspace(1)* %in, %struct.In addrspace(1)** %in.addr, align 8
	store float addrspace(1)* %out, float addrspace(1)** %out.addr, align 8			store float addrspace(1)* %out, float addrspace(1)** %out.addr, align 8
	▲ Show 20 Lines • Show All 174 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP76:%.*]] = icmp sgt <16 x i32> [[WIDE_MASKED_GATHER_15]], zeroinitializer			; AVX512-NEXT: [[TMP76:%.*]] = icmp sgt <16 x i32> [[WIDE_MASKED_GATHER_15]], zeroinitializer
	; AVX512-NEXT: [[TMP77:%.]] = getelementptr inbounds [[STRUCT_IN]], [[STRUCT_IN]] addrspace(1) [[IN]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>, i32 1			; AVX512-NEXT: [[TMP77:%.]] = getelementptr inbounds [[STRUCT_IN]], [[STRUCT_IN]] addrspace(1) [[IN]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>, i32 1
	; AVX512-NEXT: [[WIDE_MASKED_GATHER7_15:%.]] = call <16 x float> @llvm.masked.gather.v16f32.v16p1f32(<16 x float addrspace(1)> [[TMP77]], i32 4, <16 x i1> [[TMP76]], <16 x float> undef)			; AVX512-NEXT: [[WIDE_MASKED_GATHER7_15:%.]] = call <16 x float> @llvm.masked.gather.v16f32.v16p1f32(<16 x float addrspace(1)> [[TMP77]], i32 4, <16 x i1> [[TMP76]], <16 x float> undef)
	; AVX512-NEXT: [[TMP78:%.*]] = fadd <16 x float> [[WIDE_MASKED_GATHER7_15]], <float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01>			; AVX512-NEXT: [[TMP78:%.*]] = fadd <16 x float> [[WIDE_MASKED_GATHER7_15]], <float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01>
	; AVX512-NEXT: [[TMP79:%.]] = getelementptr inbounds float, float [[OUT]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>			; AVX512-NEXT: [[TMP79:%.]] = getelementptr inbounds float, float [[OUT]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>
	; AVX512-NEXT: call void @llvm.masked.scatter.v16f32.v16p0f32(<16 x float> [[TMP78]], <16 x float*> [[TMP79]], i32 4, <16 x i1> [[TMP76]])			; AVX512-NEXT: call void @llvm.masked.scatter.v16f32.v16p0f32(<16 x float> [[TMP78]], <16 x float*> [[TMP79]], i32 4, <16 x i1> [[TMP76]])
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
				; FVW2-LABEL: @foo2_addrspace2(
				; FVW2-NEXT: entry:
				; FVW2-NEXT: br label [[VECTOR_BODY:%.*]]
				; FVW2: vector.body:
				; FVW2-NEXT: [[INDEX6:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDEX_NEXT:%.]], [[PRED_STORE_CONTINUE30:%.]] ]
				; FVW2-NEXT: [[VEC_IND:%.]] = phi <2 x i64> [ <i64 0, i64 16>, [[ENTRY]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_STORE_CONTINUE30]] ]
				; FVW2-NEXT: [[STEP_ADD:%.*]] = add <2 x i64> [[VEC_IND]], <i64 32, i64 32>
				; FVW2-NEXT: [[STEP_ADD7:%.*]] = add <2 x i64> [[VEC_IND]], <i64 64, i64 64>
				; FVW2-NEXT: [[STEP_ADD8:%.*]] = add <2 x i64> [[VEC_IND]], <i64 96, i64 96>
				; FVW2-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX6]], 4
				; FVW2-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[TRIGGER:%.*]], <2 x i64> [[VEC_IND]]
				; FVW2-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD]]
				; FVW2-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD7]]
				; FVW2-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD8]]
				; FVW2-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP0]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER10:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP1]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER11:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP2]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER12:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP3]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[TMP4:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER]], zeroinitializer
				; FVW2-NEXT: [[TMP5:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER10]], zeroinitializer
				; FVW2-NEXT: [[TMP6:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER11]], zeroinitializer
				; FVW2-NEXT: [[TMP7:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER12]], zeroinitializer
				; FVW2-NEXT: [[TMP8:%.]] = getelementptr inbounds [[STRUCT_IN:%.]], [[STRUCT_IN]] addrspace(1)* [[IN:%.*]], <2 x i64> [[VEC_IND]], i32 1
				; FVW2-NEXT: [[TMP9:%.]] = getelementptr inbounds [[STRUCT_IN]], [[STRUCT_IN]] addrspace(1) [[IN]], <2 x i64> [[STEP_ADD]], i32 1
				; FVW2-NEXT: [[TMP10:%.]] = getelementptr inbounds [[STRUCT_IN]], [[STRUCT_IN]] addrspace(1) [[IN]], <2 x i64> [[STEP_ADD7]], i32 1
				; FVW2-NEXT: [[TMP11:%.]] = getelementptr inbounds [[STRUCT_IN]], [[STRUCT_IN]] addrspace(1) [[IN]], <2 x i64> [[STEP_ADD8]], i32 1
				; FVW2-NEXT: [[WIDE_MASKED_GATHER13:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p1f32(<2 x float addrspace(1)> [[TMP8]], i32 4, <2 x i1> [[TMP4]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER14:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p1f32(<2 x float addrspace(1)> [[TMP9]], i32 4, <2 x i1> [[TMP5]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER15:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p1f32(<2 x float addrspace(1)> [[TMP10]], i32 4, <2 x i1> [[TMP6]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER16:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p1f32(<2 x float addrspace(1)> [[TMP11]], i32 4, <2 x i1> [[TMP7]], <2 x float> undef)
				; FVW2-NEXT: [[TMP12:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER13]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP13:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER14]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP14:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER15]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP15:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER16]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP16:%.*]] = extractelement <2 x i1> [[TMP4]], i32 0
				; FVW2-NEXT: br i1 [[TMP16]], label [[PRED_STORE_IF:%.]], label [[PRED_STORE_CONTINUE:%.]]
				; FVW2: pred.store.if:
				; FVW2-NEXT: [[TMP17:%.]] = getelementptr inbounds float, float [[OUT:%.*]], i64 [[OFFSET_IDX]]
				; FVW2-NEXT: [[TMP18:%.*]] = extractelement <2 x float> [[TMP12]], i32 0
				; FVW2-NEXT: store float [[TMP18]], float* [[TMP17]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE]]
				; FVW2: pred.store.continue:
				; FVW2-NEXT: [[TMP19:%.*]] = extractelement <2 x i1> [[TMP4]], i32 1
				; FVW2-NEXT: br i1 [[TMP19]], label [[PRED_STORE_IF17:%.]], label [[PRED_STORE_CONTINUE18:%.]]
				; FVW2: pred.store.if17:
				; FVW2-NEXT: [[TMP20:%.*]] = or i64 [[OFFSET_IDX]], 16
				; FVW2-NEXT: [[TMP21:%.]] = getelementptr inbounds float, float [[OUT]], i64 [[TMP20]]
				; FVW2-NEXT: [[TMP22:%.*]] = extractelement <2 x float> [[TMP12]], i32 1
				; FVW2-NEXT: store float [[TMP22]], float* [[TMP21]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE18]]
				; FVW2: pred.store.continue18:
				; FVW2-NEXT: [[TMP23:%.*]] = extractelement <2 x i1> [[TMP5]], i32 0
				; FVW2-NEXT: br i1 [[TMP23]], label [[PRED_STORE_IF19:%.]], label [[PRED_STORE_CONTINUE20:%.]]
				; FVW2: pred.store.if19:
				; FVW2-NEXT: [[TMP24:%.*]] = or i64 [[OFFSET_IDX]], 32
				; FVW2-NEXT: [[TMP25:%.]] = getelementptr inbounds float, float [[OUT]], i64 [[TMP24]]
				; FVW2-NEXT: [[TMP26:%.*]] = extractelement <2 x float> [[TMP13]], i32 0
				; FVW2-NEXT: store float [[TMP26]], float* [[TMP25]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE20]]
				; FVW2: pred.store.continue20:
				; FVW2-NEXT: [[TMP27:%.*]] = extractelement <2 x i1> [[TMP5]], i32 1
				; FVW2-NEXT: br i1 [[TMP27]], label [[PRED_STORE_IF21:%.]], label [[PRED_STORE_CONTINUE22:%.]]
				; FVW2: pred.store.if21:
				; FVW2-NEXT: [[TMP28:%.*]] = or i64 [[OFFSET_IDX]], 48
				; FVW2-NEXT: [[TMP29:%.]] = getelementptr inbounds float, float [[OUT]], i64 [[TMP28]]
				; FVW2-NEXT: [[TMP30:%.*]] = extractelement <2 x float> [[TMP13]], i32 1
				; FVW2-NEXT: store float [[TMP30]], float* [[TMP29]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE22]]
				; FVW2: pred.store.continue22:
				; FVW2-NEXT: [[TMP31:%.*]] = extractelement <2 x i1> [[TMP6]], i32 0
				; FVW2-NEXT: br i1 [[TMP31]], label [[PRED_STORE_IF23:%.]], label [[PRED_STORE_CONTINUE24:%.]]
				; FVW2: pred.store.if23:
				; FVW2-NEXT: [[TMP32:%.*]] = or i64 [[OFFSET_IDX]], 64
				; FVW2-NEXT: [[TMP33:%.]] = getelementptr inbounds float, float [[OUT]], i64 [[TMP32]]
				; FVW2-NEXT: [[TMP34:%.*]] = extractelement <2 x float> [[TMP14]], i32 0
				; FVW2-NEXT: store float [[TMP34]], float* [[TMP33]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE24]]
				; FVW2: pred.store.continue24:
				; FVW2-NEXT: [[TMP35:%.*]] = extractelement <2 x i1> [[TMP6]], i32 1
				; FVW2-NEXT: br i1 [[TMP35]], label [[PRED_STORE_IF25:%.]], label [[PRED_STORE_CONTINUE26:%.]]
				; FVW2: pred.store.if25:
				; FVW2-NEXT: [[TMP36:%.*]] = or i64 [[OFFSET_IDX]], 80
				; FVW2-NEXT: [[TMP37:%.]] = getelementptr inbounds float, float [[OUT]], i64 [[TMP36]]
				; FVW2-NEXT: [[TMP38:%.*]] = extractelement <2 x float> [[TMP14]], i32 1
				; FVW2-NEXT: store float [[TMP38]], float* [[TMP37]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE26]]
				; FVW2: pred.store.continue26:
				; FVW2-NEXT: [[TMP39:%.*]] = extractelement <2 x i1> [[TMP7]], i32 0
				; FVW2-NEXT: br i1 [[TMP39]], label [[PRED_STORE_IF27:%.]], label [[PRED_STORE_CONTINUE28:%.]]
				; FVW2: pred.store.if27:
				; FVW2-NEXT: [[TMP40:%.*]] = or i64 [[OFFSET_IDX]], 96
				; FVW2-NEXT: [[TMP41:%.]] = getelementptr inbounds float, float [[OUT]], i64 [[TMP40]]
				; FVW2-NEXT: [[TMP42:%.*]] = extractelement <2 x float> [[TMP15]], i32 0
				; FVW2-NEXT: store float [[TMP42]], float* [[TMP41]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE28]]
				; FVW2: pred.store.continue28:
				; FVW2-NEXT: [[TMP43:%.*]] = extractelement <2 x i1> [[TMP7]], i32 1
				; FVW2-NEXT: br i1 [[TMP43]], label [[PRED_STORE_IF29:%.*]], label [[PRED_STORE_CONTINUE30]]
				; FVW2: pred.store.if29:
				; FVW2-NEXT: [[TMP44:%.*]] = or i64 [[OFFSET_IDX]], 112
				; FVW2-NEXT: [[TMP45:%.]] = getelementptr inbounds float, float [[OUT]], i64 [[TMP44]]
				; FVW2-NEXT: [[TMP46:%.*]] = extractelement <2 x float> [[TMP15]], i32 1
				; FVW2-NEXT: store float [[TMP46]], float* [[TMP45]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE30]]
				; FVW2: pred.store.continue30:
				; FVW2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX6]], 8
				; FVW2-NEXT: [[VEC_IND_NEXT]] = add <2 x i64> [[VEC_IND]], <i64 128, i64 128>
				; FVW2-NEXT: [[TMP47:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
				; FVW2-NEXT: br i1 [[TMP47]], label [[FOR_END:%.*]], label [[VECTOR_BODY]], !llvm.loop !5
				; FVW2: for.end:
				; FVW2-NEXT: ret void
				;
	entry:			entry:
	%in.addr = alloca %struct.In addrspace(1)*, align 8			%in.addr = alloca %struct.In addrspace(1)*, align 8
	%out.addr = alloca float addrspace(0)*, align 8			%out.addr = alloca float addrspace(0)*, align 8
	%trigger.addr = alloca i32*, align 8			%trigger.addr = alloca i32*, align 8
	%index.addr = alloca i32*, align 8			%index.addr = alloca i32*, align 8
	%i = alloca i32, align 4			%i = alloca i32, align 4
	store %struct.In addrspace(1)* %in, %struct.In addrspace(1)** %in.addr, align 8			store %struct.In addrspace(1)* %in, %struct.In addrspace(1)** %in.addr, align 8
	store float addrspace(0)* %out, float addrspace(0)** %out.addr, align 8			store float addrspace(0)* %out, float addrspace(0)** %out.addr, align 8
	▲ Show 20 Lines • Show All 174 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: [[TMP76:%.*]] = icmp sgt <16 x i32> [[WIDE_MASKED_GATHER_15]], zeroinitializer			; AVX512-NEXT: [[TMP76:%.*]] = icmp sgt <16 x i32> [[WIDE_MASKED_GATHER_15]], zeroinitializer
	; AVX512-NEXT: [[TMP77:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>, i32 1			; AVX512-NEXT: [[TMP77:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>, i32 1
	; AVX512-NEXT: [[WIDE_MASKED_GATHER7_15:%.]] = call <16 x float> @llvm.masked.gather.v16f32.v16p0f32(<16 x float> [[TMP77]], i32 4, <16 x i1> [[TMP76]], <16 x float> undef)			; AVX512-NEXT: [[WIDE_MASKED_GATHER7_15:%.]] = call <16 x float> @llvm.masked.gather.v16f32.v16p0f32(<16 x float> [[TMP77]], i32 4, <16 x i1> [[TMP76]], <16 x float> undef)
	; AVX512-NEXT: [[TMP78:%.*]] = fadd <16 x float> [[WIDE_MASKED_GATHER7_15]], <float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01>			; AVX512-NEXT: [[TMP78:%.*]] = fadd <16 x float> [[WIDE_MASKED_GATHER7_15]], <float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01>
	; AVX512-NEXT: [[TMP79:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>			; AVX512-NEXT: [[TMP79:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], <16 x i64> <i64 3840, i64 3856, i64 3872, i64 3888, i64 3904, i64 3920, i64 3936, i64 3952, i64 3968, i64 3984, i64 4000, i64 4016, i64 4032, i64 4048, i64 4064, i64 4080>
	; AVX512-NEXT: call void @llvm.masked.scatter.v16f32.v16p1f32(<16 x float> [[TMP78]], <16 x float addrspace(1)*> [[TMP79]], i32 4, <16 x i1> [[TMP76]])			; AVX512-NEXT: call void @llvm.masked.scatter.v16f32.v16p1f32(<16 x float> [[TMP78]], <16 x float addrspace(1)*> [[TMP79]], i32 4, <16 x i1> [[TMP76]])
	; AVX512-NEXT: ret void			; AVX512-NEXT: ret void
	;			;
				; FVW2-LABEL: @foo2_addrspace3(
				; FVW2-NEXT: entry:
				; FVW2-NEXT: br label [[VECTOR_BODY:%.*]]
				; FVW2: vector.body:
				; FVW2-NEXT: [[INDEX6:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDEX_NEXT:%.]], [[PRED_STORE_CONTINUE30:%.]] ]
				; FVW2-NEXT: [[VEC_IND:%.]] = phi <2 x i64> [ <i64 0, i64 16>, [[ENTRY]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_STORE_CONTINUE30]] ]
				; FVW2-NEXT: [[STEP_ADD:%.*]] = add <2 x i64> [[VEC_IND]], <i64 32, i64 32>
				; FVW2-NEXT: [[STEP_ADD7:%.*]] = add <2 x i64> [[VEC_IND]], <i64 64, i64 64>
				; FVW2-NEXT: [[STEP_ADD8:%.*]] = add <2 x i64> [[VEC_IND]], <i64 96, i64 96>
				; FVW2-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX6]], 4
				; FVW2-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[TRIGGER:%.*]], <2 x i64> [[VEC_IND]]
				; FVW2-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD]]
				; FVW2-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD7]]
				; FVW2-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TRIGGER]], <2 x i64> [[STEP_ADD8]]
				; FVW2-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP0]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER10:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP1]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER11:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP2]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER12:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> [[TMP3]], i32 4, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; FVW2-NEXT: [[TMP4:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER]], zeroinitializer
				; FVW2-NEXT: [[TMP5:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER10]], zeroinitializer
				; FVW2-NEXT: [[TMP6:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER11]], zeroinitializer
				; FVW2-NEXT: [[TMP7:%.*]] = icmp sgt <2 x i32> [[WIDE_MASKED_GATHER12]], zeroinitializer
				; FVW2-NEXT: [[TMP8:%.]] = getelementptr inbounds [[STRUCT_IN:%.]], %struct.In* [[IN:%.*]], <2 x i64> [[VEC_IND]], i32 1
				; FVW2-NEXT: [[TMP9:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <2 x i64> [[STEP_ADD]], i32 1
				; FVW2-NEXT: [[TMP10:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <2 x i64> [[STEP_ADD7]], i32 1
				; FVW2-NEXT: [[TMP11:%.]] = getelementptr inbounds [[STRUCT_IN]], %struct.In [[IN]], <2 x i64> [[STEP_ADD8]], i32 1
				; FVW2-NEXT: [[WIDE_MASKED_GATHER13:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP8]], i32 4, <2 x i1> [[TMP4]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER14:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP9]], i32 4, <2 x i1> [[TMP5]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER15:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP10]], i32 4, <2 x i1> [[TMP6]], <2 x float> undef)
				; FVW2-NEXT: [[WIDE_MASKED_GATHER16:%.]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float> [[TMP11]], i32 4, <2 x i1> [[TMP7]], <2 x float> undef)
				; FVW2-NEXT: [[TMP12:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER13]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP13:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER14]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP14:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER15]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP15:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER16]], <float 5.000000e-01, float 5.000000e-01>
				; FVW2-NEXT: [[TMP16:%.*]] = extractelement <2 x i1> [[TMP4]], i32 0
				; FVW2-NEXT: br i1 [[TMP16]], label [[PRED_STORE_IF:%.]], label [[PRED_STORE_CONTINUE:%.]]
				; FVW2: pred.store.if:
				; FVW2-NEXT: [[TMP17:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT:%.*]], i64 [[OFFSET_IDX]]
				; FVW2-NEXT: [[TMP18:%.*]] = extractelement <2 x float> [[TMP12]], i32 0
				; FVW2-NEXT: store float [[TMP18]], float addrspace(1)* [[TMP17]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE]]
				; FVW2: pred.store.continue:
				; FVW2-NEXT: [[TMP19:%.*]] = extractelement <2 x i1> [[TMP4]], i32 1
				; FVW2-NEXT: br i1 [[TMP19]], label [[PRED_STORE_IF17:%.]], label [[PRED_STORE_CONTINUE18:%.]]
				; FVW2: pred.store.if17:
				; FVW2-NEXT: [[TMP20:%.*]] = or i64 [[OFFSET_IDX]], 16
				; FVW2-NEXT: [[TMP21:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], i64 [[TMP20]]
				; FVW2-NEXT: [[TMP22:%.*]] = extractelement <2 x float> [[TMP12]], i32 1
				; FVW2-NEXT: store float [[TMP22]], float addrspace(1)* [[TMP21]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE18]]
				; FVW2: pred.store.continue18:
				; FVW2-NEXT: [[TMP23:%.*]] = extractelement <2 x i1> [[TMP5]], i32 0
				; FVW2-NEXT: br i1 [[TMP23]], label [[PRED_STORE_IF19:%.]], label [[PRED_STORE_CONTINUE20:%.]]
				; FVW2: pred.store.if19:
				; FVW2-NEXT: [[TMP24:%.*]] = or i64 [[OFFSET_IDX]], 32
				; FVW2-NEXT: [[TMP25:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], i64 [[TMP24]]
				; FVW2-NEXT: [[TMP26:%.*]] = extractelement <2 x float> [[TMP13]], i32 0
				; FVW2-NEXT: store float [[TMP26]], float addrspace(1)* [[TMP25]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE20]]
				; FVW2: pred.store.continue20:
				; FVW2-NEXT: [[TMP27:%.*]] = extractelement <2 x i1> [[TMP5]], i32 1
				; FVW2-NEXT: br i1 [[TMP27]], label [[PRED_STORE_IF21:%.]], label [[PRED_STORE_CONTINUE22:%.]]
				; FVW2: pred.store.if21:
				; FVW2-NEXT: [[TMP28:%.*]] = or i64 [[OFFSET_IDX]], 48
				; FVW2-NEXT: [[TMP29:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], i64 [[TMP28]]
				; FVW2-NEXT: [[TMP30:%.*]] = extractelement <2 x float> [[TMP13]], i32 1
				; FVW2-NEXT: store float [[TMP30]], float addrspace(1)* [[TMP29]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE22]]
				; FVW2: pred.store.continue22:
				; FVW2-NEXT: [[TMP31:%.*]] = extractelement <2 x i1> [[TMP6]], i32 0
				; FVW2-NEXT: br i1 [[TMP31]], label [[PRED_STORE_IF23:%.]], label [[PRED_STORE_CONTINUE24:%.]]
				; FVW2: pred.store.if23:
				; FVW2-NEXT: [[TMP32:%.*]] = or i64 [[OFFSET_IDX]], 64
				; FVW2-NEXT: [[TMP33:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], i64 [[TMP32]]
				; FVW2-NEXT: [[TMP34:%.*]] = extractelement <2 x float> [[TMP14]], i32 0
				; FVW2-NEXT: store float [[TMP34]], float addrspace(1)* [[TMP33]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE24]]
				; FVW2: pred.store.continue24:
				; FVW2-NEXT: [[TMP35:%.*]] = extractelement <2 x i1> [[TMP6]], i32 1
				; FVW2-NEXT: br i1 [[TMP35]], label [[PRED_STORE_IF25:%.]], label [[PRED_STORE_CONTINUE26:%.]]
				; FVW2: pred.store.if25:
				; FVW2-NEXT: [[TMP36:%.*]] = or i64 [[OFFSET_IDX]], 80
				; FVW2-NEXT: [[TMP37:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], i64 [[TMP36]]
				; FVW2-NEXT: [[TMP38:%.*]] = extractelement <2 x float> [[TMP14]], i32 1
				; FVW2-NEXT: store float [[TMP38]], float addrspace(1)* [[TMP37]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE26]]
				; FVW2: pred.store.continue26:
				; FVW2-NEXT: [[TMP39:%.*]] = extractelement <2 x i1> [[TMP7]], i32 0
				; FVW2-NEXT: br i1 [[TMP39]], label [[PRED_STORE_IF27:%.]], label [[PRED_STORE_CONTINUE28:%.]]
				; FVW2: pred.store.if27:
				; FVW2-NEXT: [[TMP40:%.*]] = or i64 [[OFFSET_IDX]], 96
				; FVW2-NEXT: [[TMP41:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], i64 [[TMP40]]
				; FVW2-NEXT: [[TMP42:%.*]] = extractelement <2 x float> [[TMP15]], i32 0
				; FVW2-NEXT: store float [[TMP42]], float addrspace(1)* [[TMP41]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE28]]
				; FVW2: pred.store.continue28:
				; FVW2-NEXT: [[TMP43:%.*]] = extractelement <2 x i1> [[TMP7]], i32 1
				; FVW2-NEXT: br i1 [[TMP43]], label [[PRED_STORE_IF29:%.*]], label [[PRED_STORE_CONTINUE30]]
				; FVW2: pred.store.if29:
				; FVW2-NEXT: [[TMP44:%.*]] = or i64 [[OFFSET_IDX]], 112
				; FVW2-NEXT: [[TMP45:%.]] = getelementptr inbounds float, float addrspace(1) [[OUT]], i64 [[TMP44]]
				; FVW2-NEXT: [[TMP46:%.*]] = extractelement <2 x float> [[TMP15]], i32 1
				; FVW2-NEXT: store float [[TMP46]], float addrspace(1)* [[TMP45]], align 4
				; FVW2-NEXT: br label [[PRED_STORE_CONTINUE30]]
				; FVW2: pred.store.continue30:
				; FVW2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX6]], 8
				; FVW2-NEXT: [[VEC_IND_NEXT]] = add <2 x i64> [[VEC_IND]], <i64 128, i64 128>
				; FVW2-NEXT: [[TMP47:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
				; FVW2-NEXT: br i1 [[TMP47]], label [[FOR_END:%.*]], label [[VECTOR_BODY]], !llvm.loop !6
				; FVW2: for.end:
				; FVW2-NEXT: ret void
				;
	entry:			entry:
	%in.addr = alloca %struct.In addrspace(0)*, align 8			%in.addr = alloca %struct.In addrspace(0)*, align 8
	%out.addr = alloca float addrspace(1)*, align 8			%out.addr = alloca float addrspace(1)*, align 8
	%trigger.addr = alloca i32*, align 8			%trigger.addr = alloca i32*, align 8
	%index.addr = alloca i32*, align 8			%index.addr = alloca i32*, align 8
	%i = alloca i32, align 4			%i = alloca i32, align 4
	store %struct.In addrspace(0)* %in, %struct.In addrspace(0)** %in.addr, align 8			store %struct.In addrspace(0)* %in, %struct.In addrspace(0)** %in.addr, align 8
	store float addrspace(1)* %out, float addrspace(1)** %out.addr, align 8			store float addrspace(1)* %out, float addrspace(1)** %out.addr, align 8
	▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/X86/invariant-load-gather.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -loop-vectorize -S -mcpu=skylake-avx512 -instcombine < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				define i32 @inv_load_conditional(i32* %a, i64 %n, i32* %b, i32 %k) {
				; CHECK-LABEL: @inv_load_conditional(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[NTRUNC:%.]] = trunc i64 [[N:%.]] to i32
				; CHECK-NEXT: [[TMP0:%.*]] = icmp sgt i64 [[N]], 1
				; CHECK-NEXT: [[SMAX:%.*]] = select i1 [[TMP0]], i64 [[N]], i64 1
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[SMAX]], 16
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
				; CHECK: vector.memcheck:
				; CHECK-NEXT: [[A4:%.]] = bitcast i32 [[A:%.]] to i8
				; CHECK-NEXT: [[B1:%.]] = bitcast i32 [[B:%.]] to i8
				; CHECK-NEXT: [[TMP1:%.*]] = icmp sgt i64 [[N]], 1
				; CHECK-NEXT: [[SMAX2:%.*]] = select i1 [[TMP1]], i64 [[N]], i64 1
				; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr i32, i32 [[B]], i64 [[SMAX2]]
				; CHECK-NEXT: [[UGLYGEP:%.]] = getelementptr i8, i8 [[A4]], i64 1
				; CHECK-NEXT: [[BOUND0:%.]] = icmp ugt i8 [[UGLYGEP]], [[B1]]
				; CHECK-NEXT: [[BOUND1:%.]] = icmp ugt i32 [[SCEVGEP]], [[A]]
				; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
				; CHECK-NEXT: br i1 [[FOUND_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[SMAX]], 9223372036854775792
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT5:%.]] = insertelement <16 x i32> undef, i32* [[A]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT6:%.]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT5]], <16 x i32*> undef, <16 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT7:%.*]] = insertelement <16 x i32> undef, i32 [[NTRUNC]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT8:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT7]], <16 x i32> undef, <16 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDEX]]
				; CHECK-NEXT: [[TMP3:%.]] = icmp ne <16 x i32> [[BROADCAST_SPLAT6]], zeroinitializer
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[TMP2]] to <16 x i32>*
				; CHECK-NEXT: store <16 x i32> [[BROADCAST_SPLAT8]], <16 x i32>* [[TMP4]], align 4, !alias.scope !0, !noalias !3
				; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
				; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !5
				; CHECK: middle.block:
				; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <16 x i32> @llvm.masked.gather.v16i32.v16p0i32(<16 x i32> [[BROADCAST_SPLAT6]], i32 4, <16 x i1> [[TMP3]], <16 x i32> undef), !alias.scope !3
				; CHECK-NEXT: [[PREDPHI:%.*]] = select <16 x i1> [[TMP3]], <16 x i32> [[WIDE_MASKED_GATHER]], <16 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 1>
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[SMAX]], [[N_VEC]]
				; CHECK-NEXT: [[TMP6:%.*]] = extractelement <16 x i32> [[PREDPHI]], i32 15
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ], [ 0, [[VECTOR_MEMCHECK]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[LATCH:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[I]]
				; CHECK-NEXT: [[CMP:%.]] = icmp eq i32 [[A]], null
				; CHECK-NEXT: store i32 [[NTRUNC]], i32* [[TMP1]], align 4
				; CHECK-NEXT: br i1 [[CMP]], label [[LATCH]], label [[COND_LOAD:%.*]]
				; CHECK: cond_load:
				; CHECK-NEXT: [[ALOAD:%.]] = load i32, i32 [[A]], align 4
				; CHECK-NEXT: br label [[LATCH]]
				; CHECK: latch:
				; CHECK-NEXT: [[A_LCSSA:%.*]] = phi i32 [ [[ALOAD]], [[COND_LOAD]] ], [ 1, [[FOR_BODY]] ]
				; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
				; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
				; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop !7
				; CHECK: for.end:
				; CHECK-NEXT: [[A_LCSSA_LCSSA:%.*]] = phi i32 [ [[A_LCSSA]], [[LATCH]] ], [ [[TMP6]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: ret i32 [[A_LCSSA_LCSSA]]
				;
				entry:
				%ntrunc = trunc i64 %n to i32
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i = phi i64 [ %i.next, %latch ], [ 0, %entry ]
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp1, align 8
				%cmp = icmp ne i32* %a, null
				store i32 %ntrunc, i32* %tmp1
				br i1 %cmp, label %cond_load, label %latch

				cond_load:
				%aload = load i32, i32* %a, align 4
				br label %latch

				latch:
				%a.lcssa = phi i32 [ %aload, %cond_load ], [ 1, %for.body ]
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret i32 %a.lcssa
				}