This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/
-
ARM/
2/2
ARMTargetTransformInfo.h
-
ARMTargetTransformInfo.cpp
-
X86/
-
X86TargetTransformInfo.h
2/2
X86TargetTransformInfo.cpp
-
Transforms/
-
Scalar/
-
ScalarizeMaskedMemIntrin.cpp
-
Vectorize/
2/4
LoopVectorize.cpp
-
test/Transforms/
-
Transforms/
-
LoopVectorize/AArch64/
-
AArch64/
-
tail-fold-uniform-memops.ll
-
vector-reverse-mask4.ll
-
SLPVectorizer/X86/
-
X86/
-
pr47623.ll
-
pr47629-inseltpoison.ll
-
pr47629.ll

Differential D115329

[LoopVectorize] Pass a vector type to isLegalMaskedGather/Scatter
ClosedPublic

Authored by RosieSumpter on Dec 8 2021, 4:15 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
david-arm
kmclaughlin
dmgreen
RKSimon
lebedev.ri
fhahn

Commits

rG552eb372cb81: [LoopVectorize] Pass a vector type to isLegalMaskedGather/Scatter

Summary

This is required to query the legality more precisely in the LoopVectorizer.

This adds another TTI function named 'forceScalarizeMaskedGather/Scatter'
function to work around the hack introduced for MVE, where
isLegalMaskedGather/Scatter would return an answer by second-guessing
where the function was called from, based on the Type passed in (vector
vs scalar). The new interface makes this explicit. It is also used by
X86 to check for vector widths where gather/scatters aren't profitable
(or don't exist) for certain subtargets.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

RosieSumpter created this revision.Dec 8 2021, 4:15 AM

Herald added subscribers: pengfei, arphaman, hiraditya. · View Herald TranscriptDec 8 2021, 4:15 AM

RosieSumpter requested review of this revision.Dec 8 2021, 4:15 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 8 2021, 4:15 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

david-arm added inline comments.Dec 8 2021, 4:51 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1556	According to the default `VF = ElementCount::getFixed(1)` it looks like we can still have an inconsistency during vectorisation where we get different answers? I think it's also called from `collectElementTypesForWidening` so we could potentially say a gather is legal when the type is scalar, but then say it's illegal in setCostBasedWideningDecision. I wonder if it's worth adding a comment there saying something like: // We're simply querying at this point if the target even supports any vector // gathers and scatters for the given element type? Certain vector forms may // still be illegal, but setCostBasedWideningDecision will distinguish between // the legal behaviour for different VFs at that point and generate costs accordingly. Also, does it matter that `LoopVectorizationCostModel::isScalarWithPredication` is still passing in an element type rather than a vector? For example, `collectLoopUniforms` calls `isScalarWithPredication` and at that point knows the VF. It may be fine, but I'm just a bit worried we might be swapping one inconsistency for another that's all.

Harbormaster completed remote builds in B138117: Diff 392706.Dec 8 2021, 5:05 AM

sdesmalen added inline comments.Dec 8 2021, 6:12 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1556	The code in `collectElementTypesForWidening` seems like a hack to get the LV to choose a wider VF and it seems specific to X86. If I replace that whole expression with `if (!T->isSized()) continue`, I only get a single test that fails because it explicitly tests for it to use a wider VF. I doubt this is the right mechanism to use for that kind of purpose going forward. In any case, if `collectElementTypesForWidening` would consider using a wider VF, it's still the cost-model that will then narrow that down (after this patch), because passing in the VF will result in a more restricted answer for `isLegalMaskedGather`. Also, does it matter that LoopVectorizationCostModel::isScalarWithPredication is still passing in an element type rather than a vector? For example, collectLoopUniforms calls isScalarWithPredication and at that point knows the VF. It may be fine, but I'm just a bit worried we might be swapping one inconsistency for another that's all. I don't know if there's an actual problem atm, but I agree it makes sense to pass it to isScalarWithPredication as well. It should be possible since VF is available in all functions that call it.

MVE part sounds OK to me. Is the X86 part expected to be the same or better?

lebedev.ri added inline comments.Dec 9 2021, 1:33 AM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
5162–5163	Can you move this code into `forceScalarizeMaskedGather()`? This should not have been here, LV should rely on cost model.

Moved code which checks if masked gathers/scatters are legal for the given vector width to forceScalarizeMaskedGather/Scatter for X86
Pass VF to isScalarWithPredication so that we are consistently passing a vector type to isLegalMaskedGather/Scatter
Add minimum vscale attribute to the necessary AArch64 tests
Remove changes to X86 tests (X86 behaviour is now unchanged by this patch)

RosieSumpter marked an inline comment as done.Dec 10 2021, 3:27 AM

Harbormaster completed remote builds in B138626: Diff 393424.Dec 10 2021, 4:22 AM

@lebedev.ri/@fhahn given that there exists a pass that scalarizes the masked load/store/gather/scatter intrinsics, is there a reason why we wouldn't always want to choose the intrinsic representation over scalarizing it in the LV? It seems a bit odd to want to trick the LV to think that a masked gather is legal, but then still scalarize it in the end. This means it is not using the right cost-model, unless it reimplements the scalarization costs, like is done by X86TargetTransformInfo::getGSScalarCost.

llvm/lib/Target/ARM/ARMTargetTransformInfo.h
194	nit: s/can to/can lower to/
195–196	nit: s/, so if we are here, we know we want to expand//
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
5155	This is casting to `FixedVectorType`, so I think we should we make the interface take `VectorType` instead of `Type`.

In D115329#3188834, @sdesmalen wrote:

@lebedev.ri/@fhahn given that there exists a pass that scalarizes the masked load/store/gather/scatter intrinsics,
is there a reason why we wouldn't always want to choose the intrinsic representation over scalarizing it in the LV?
It seems a bit odd to want to trick the LV to think that a masked gather is legal, but then still scalarize it in the end.
This means it is not using the right cost-model, unless it reimplements the scalarization costs,
like is done by X86TargetTransformInfo::getGSScalarCost.

The fact that something is/isn't legal does not mean that LV will pick *that* particular path.
It merely allows LV to consider this path, among the others,
and out of all the legal paths, pick the most profitable one.

We don't "trick LV to think it's legal and then scalarize anyway", it *is* legal.
But for those particular VF's, it's more costly than the native scalarized path.
So LV won't actually produce gather intrinsics, and naturally, they won't be scalarized.

Furthermore, no, while we could just always prefer the intrinsics (from correctness viewpoint),
that is detrimental to the performance. See e.g. my recent patches, namely D111220 / D111363 / D111546.

So no, what happens is correct.

In D115329#3188963, @lebedev.ri wrote:

The fact that something is/isn't legal does not mean that LV will pick *that* particular path.
It merely allows LV to consider this path, among the others,
and out of all the legal paths, pick the most profitable one.

The LV makes the decision to use gathers/scatters because the cost-model said the gather/scatter cost was lower than the scalarization cost, but that only works if the gather/scatter cost is correct. If it will be scalarized, then I'd expect this costs to be no different than the scalarization cost.

We don't "trick LV to think it's legal and then scalarize anyway", it *is* legal.

I'm not sure what the meaning of "legal" is in that case. My understanding of the term is that it means that the target can handle the operation directly without legalization, e.g. using a special instruction or a sequence of instructions. From that perspective, saying that <operation> is legal, means it will not need to be scalarized. (conversely, the ScalarizeMaskedOperations pass only scalarizes the intrinsic if it is *not* marked as legal). If the LV asks the question "is a masked gather operation legal?" and it returns true, it seems wrong to me to query the cost as if it were a gather rather than asking for the scalarization cost, but then later still decide to scalarize the operation.

But for those particular VF's, it's more costly than the native scalarized path.
So LV won't actually produce gather intrinsics, and naturally, they won't be scalarized.

Furthermore, no, while we could just always prefer the intrinsics (from correctness viewpoint),
that is detrimental to the performance. See e.g. my recent patches, namely D111220 / D111363 / D111546.

This is the answer I was after, thanks for pointing me to those patches!

Change data type parameter for forceScalarizeMaskedGather/Scatter to be a VectorType instead of a Type
Address nits

In D115329#3189215, @sdesmalen wrote:

In D115329#3188963, @lebedev.ri wrote:

The fact that something is/isn't legal does not mean that LV will pick *that* particular path.
It merely allows LV to consider this path, among the others,
and out of all the legal paths, pick the most profitable one.

The LV makes the decision to use gathers/scatters because the cost-model said the gather/scatter cost was lower than the scalarization cost, but that only works if the gather/scatter cost is correct. If it will be scalarized, then I'd expect this costs to be no different than the scalarization cost.

If the cost model is wrong then lots of weird shit will go wrong, yes.

We don't "trick LV to think it's legal and then scalarize anyway", it *is* legal.

I'm not sure what the meaning of "legal" is in that case.

That the gather with VF=2 can be done by the same instruction as for the VF=4, but after padding the mask with zeros.

My understanding of the term is that it means that the target can handle the operation directly without legalization, e.g. using a special instruction or a sequence of instructions.

From that perspective, saying that <operation> is legal, means it will not need to be scalarized.

Define "need".
It is not much different from what VectorCombine does ("would scalar cost be lower than vector cost?
if so, scalarize" or the other way around), just implemented in an another pass.

(conversely, the ScalarizeMaskedOperations pass only scalarizes the intrinsic if it is *not* marked as legal). If the LV asks the question "is a masked gather operation legal?" and it returns true, it seems wrong to me to query the cost as if it were a gather rather than asking for the scalarization cost, but then later still decide to scalarize the operation.

But for those particular VF's, it's more costly than the native scalarized path.
So LV won't actually produce gather intrinsics, and naturally, they won't be scalarized.

Furthermore, no, while we could just always prefer the intrinsics (from correctness viewpoint),
that is detrimental to the performance. See e.g. my recent patches, namely D111220 / D111363 / D111546.

This is the answer I was after, thanks for pointing me to those patches!

Cheers.

RosieSumpter marked 3 inline comments as done.Dec 13 2021, 8:34 AM

Harbormaster completed remote builds in B138981: Diff 393907.Dec 13 2021, 9:16 AM

In D115329#3189241, @lebedev.ri wrote:

In D115329#3189215, @sdesmalen wrote:

From that perspective, saying that <operation> is legal, means it will not need to be scalarized.

Define "need".
It is not much different from what VectorCombine does ("would scalar cost be lower than vector cost?
if so, scalarize" or the other way around), just implemented in an another pass.

I see your point, it just seemed odd to me that for X86 this required reimplementing a different scalarisation cost rather than relying on the mechanisms already present in LoopVectorize, which tries to cost as accurately as possible whether to scalarize or use the intrinsics.
To me that suggests that the scalarization cost in LoopVectorize is inaccurate or there's improvements to be made to the code for scalarizing in the LoopVectorizer (or both) to make sure there is no disparity in performance.

As you explained, it's currently doing the right thing in the sense that it leads to the desired result.

sdesmalen added inline comments.Dec 14 2021, 6:05 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3377 ↗	(On Diff #393907)	Hi @RosieSumpter this check should probably not be here anymore. Only isLegalMaskedGather should be used to determine whether or not to use an intrinsic for the gather operation.

Removed forceScalarizeMaskedGather check from SLPVectorizer.cpp (since we don't check it in LoopVectorize.cpp)
Updated necessary SLPVectorizer/X86 tests

RosieSumpter marked an inline comment as done.Dec 15 2021, 3:32 AM

Harbormaster completed remote builds in B139400: Diff 394511.Dec 15 2021, 4:10 AM

Hi @lebedev.ri, the SLPVectorizer/X86/pr47629 tests have changed because isLegalMaskedGather now returns true for certain cases where it didn't before (due to the check on the number of vector elements now being in forceScalarizeMaskedGather as requested). It then calculates the gather/scatter cost, and because forceScalarizeMaskedGather returns true, it calculates the cost using getGSScalarCost. This cost is higher than before and so it chooses not to vectorize. An alternative approach would be to check forceScalarizeMaskedGather in isLegalMaskedGather instead of when calculating the cost, but this will then mean LoopVectorize will assume these operations need to be scalarized so will cause test failures there. What do you think the preferred option is here?

Hi @RosieSumpter, just wanted to let you know I'm happy with the patch. This looks like a good improvement to me. It conceptually makes more sense and it also allows AArch64TTI to better distinguish between Neon/SVE when using fixed-width vectors.
I want to accept the patch, but I'll defer this to @lebedev.ri because there's still an open question about the X86 cost-model.

Friendly ping @lebedev.ri, do you have any further comments for this patch?

No concerns from my side, i guess this is one of the reasonable solutions.

sdesmalen accepted this revision.Jan 4 2022, 2:50 AM

This revision is now accepted and ready to land.Jan 4 2022, 2:50 AM

Are the X86 behavior changes intentional?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1570–1573	documentation needs to be updated.
1579–1580	VF should be documented.

Updated descriptions of isScalarWithPredication and isPredicatedInst
rebased

In D115329#3219230, @fhahn wrote:

Are the X86 behavior changes intentional?

Hi @fhahn, yes these changes are intentional - the cost model for X86 is now more accurate, and in some cases it ends up with a higher cost through getGSScalarCost because it knows the cost of the gather/scatter will be expanded later.

Harbormaster completed remote builds in B142469: Diff 398676.Jan 10 2022, 10:22 AM

Closed by commit rG552eb372cb81: [LoopVectorize] Pass a vector type to isLegalMaskedGather/Scatter (authored by RosieSumpter). · Explain WhyJan 12 2022, 5:38 AM

This revision was automatically updated to reflect the committed changes.

RosieSumpter added a commit: rG552eb372cb81: [LoopVectorize] Pass a vector type to isLegalMaskedGather/Scatter.

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

18 lines

TargetTransformInfoImpl.h

9 lines

lib/

Analysis/

TargetTransformInfo.cpp

10 lines

Target/

ARM/

ARMTargetTransformInfo.h

12 lines

ARMTargetTransformInfo.cpp

12 lines

X86/

X86TargetTransformInfo.h

4 lines

X86TargetTransformInfo.cpp

54 lines

Transforms/

Scalar/

ScalarizeMaskedMemIntrin.cpp

7 lines

Vectorize/

LoopVectorize.cpp

71 lines

test/

Transforms/

LoopVectorize/

AArch64/

tail-fold-uniform-memops.ll

2 lines

vector-reverse-mask4.ll

2 lines

SLPVectorizer/

X86/

pr47623.ll

10 lines

pr47629-inseltpoison.ll

130 lines

pr47629.ll

130 lines

Diff 399299

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 657 Lines • ▼ Show 20 Lines	public:
bool isLegalNTStore(Type *DataType, Align Alignment) const;		bool isLegalNTStore(Type *DataType, Align Alignment) const;
/// Return true if the target supports nontemporal load.		/// Return true if the target supports nontemporal load.
bool isLegalNTLoad(Type *DataType, Align Alignment) const;		bool isLegalNTLoad(Type *DataType, Align Alignment) const;

/// Return true if the target supports masked scatter.		/// Return true if the target supports masked scatter.
bool isLegalMaskedScatter(Type *DataType, Align Alignment) const;		bool isLegalMaskedScatter(Type *DataType, Align Alignment) const;
/// Return true if the target supports masked gather.		/// Return true if the target supports masked gather.
bool isLegalMaskedGather(Type *DataType, Align Alignment) const;		bool isLegalMaskedGather(Type *DataType, Align Alignment) const;
		/// Return true if the target forces scalarizing of llvm.masked.gather
		/// intrinsics.
		bool forceScalarizeMaskedGather(VectorType *Type, Align Alignment) const;
		/// Return true if the target forces scalarizing of llvm.masked.scatter
		/// intrinsics.
		bool forceScalarizeMaskedScatter(VectorType *Type, Align Alignment) const;

/// Return true if the target supports masked compress store.		/// Return true if the target supports masked compress store.
bool isLegalMaskedCompressStore(Type *DataType) const;		bool isLegalMaskedCompressStore(Type *DataType) const;
/// Return true if the target supports masked expand load.		/// Return true if the target supports masked expand load.
bool isLegalMaskedExpandLoad(Type *DataType) const;		bool isLegalMaskedExpandLoad(Type *DataType) const;

/// Return true if we should be enabling ordered reductions for the target.		/// Return true if we should be enabling ordered reductions for the target.
bool enableOrderedReductions() const;		bool enableOrderedReductions() const;
▲ Show 20 Lines • Show All 865 Lines • ▼ Show 20 Lines	public:
virtual AddressingModeKind		virtual AddressingModeKind
getPreferredAddressingMode(const Loop L, ScalarEvolution SE) const = 0;		getPreferredAddressingMode(const Loop L, ScalarEvolution SE) const = 0;
virtual bool isLegalMaskedStore(Type *DataType, Align Alignment) = 0;		virtual bool isLegalMaskedStore(Type *DataType, Align Alignment) = 0;
virtual bool isLegalMaskedLoad(Type *DataType, Align Alignment) = 0;		virtual bool isLegalMaskedLoad(Type *DataType, Align Alignment) = 0;
virtual bool isLegalNTStore(Type *DataType, Align Alignment) = 0;		virtual bool isLegalNTStore(Type *DataType, Align Alignment) = 0;
virtual bool isLegalNTLoad(Type *DataType, Align Alignment) = 0;		virtual bool isLegalNTLoad(Type *DataType, Align Alignment) = 0;
virtual bool isLegalMaskedScatter(Type *DataType, Align Alignment) = 0;		virtual bool isLegalMaskedScatter(Type *DataType, Align Alignment) = 0;
virtual bool isLegalMaskedGather(Type *DataType, Align Alignment) = 0;		virtual bool isLegalMaskedGather(Type *DataType, Align Alignment) = 0;
		virtual bool forceScalarizeMaskedGather(VectorType *DataType,
		Align Alignment) = 0;
		virtual bool forceScalarizeMaskedScatter(VectorType *DataType,
		Align Alignment) = 0;
virtual bool isLegalMaskedCompressStore(Type *DataType) = 0;		virtual bool isLegalMaskedCompressStore(Type *DataType) = 0;
virtual bool isLegalMaskedExpandLoad(Type *DataType) = 0;		virtual bool isLegalMaskedExpandLoad(Type *DataType) = 0;
virtual bool enableOrderedReductions() = 0;		virtual bool enableOrderedReductions() = 0;
virtual bool hasDivRemOp(Type *DataType, bool IsSigned) = 0;		virtual bool hasDivRemOp(Type *DataType, bool IsSigned) = 0;
virtual bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) = 0;		virtual bool hasVolatileVariant(Instruction *I, unsigned AddrSpace) = 0;
virtual bool prefersVectorizedAddressing() = 0;		virtual bool prefersVectorizedAddressing() = 0;
virtual InstructionCost getScalingFactorCost(Type Ty, GlobalValue BaseGV,		virtual InstructionCost getScalingFactorCost(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset,		int64_t BaseOffset,
▲ Show 20 Lines • Show All 387 Lines • ▼ Show 20 Lines	bool isLegalNTLoad(Type *DataType, Align Alignment) override {
return Impl.isLegalNTLoad(DataType, Alignment);		return Impl.isLegalNTLoad(DataType, Alignment);
}		}
bool isLegalMaskedScatter(Type *DataType, Align Alignment) override {		bool isLegalMaskedScatter(Type *DataType, Align Alignment) override {
return Impl.isLegalMaskedScatter(DataType, Alignment);		return Impl.isLegalMaskedScatter(DataType, Alignment);
}		}
bool isLegalMaskedGather(Type *DataType, Align Alignment) override {		bool isLegalMaskedGather(Type *DataType, Align Alignment) override {
return Impl.isLegalMaskedGather(DataType, Alignment);		return Impl.isLegalMaskedGather(DataType, Alignment);
}		}
		bool forceScalarizeMaskedGather(VectorType *DataType,
		Align Alignment) override {
		return Impl.forceScalarizeMaskedGather(DataType, Alignment);
		}
		bool forceScalarizeMaskedScatter(VectorType *DataType,
		Align Alignment) override {
		return Impl.forceScalarizeMaskedScatter(DataType, Alignment);
		}
bool isLegalMaskedCompressStore(Type *DataType) override {		bool isLegalMaskedCompressStore(Type *DataType) override {
return Impl.isLegalMaskedCompressStore(DataType);		return Impl.isLegalMaskedCompressStore(DataType);
}		}
bool isLegalMaskedExpandLoad(Type *DataType) override {		bool isLegalMaskedExpandLoad(Type *DataType) override {
return Impl.isLegalMaskedExpandLoad(DataType);		return Impl.isLegalMaskedExpandLoad(DataType);
}		}
bool enableOrderedReductions() override {		bool enableOrderedReductions() override {
return Impl.enableOrderedReductions();		return Impl.enableOrderedReductions();
▲ Show 20 Lines • Show All 527 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 261 Lines • ▼ Show 20 Lines	public:
bool isLegalMaskedScatter(Type *DataType, Align Alignment) const {		bool isLegalMaskedScatter(Type *DataType, Align Alignment) const {
return false;		return false;
}		}

bool isLegalMaskedGather(Type *DataType, Align Alignment) const {		bool isLegalMaskedGather(Type *DataType, Align Alignment) const {
return false;		return false;
}		}

		bool forceScalarizeMaskedGather(VectorType *DataType, Align Alignment) const {
		return false;
		}

		bool forceScalarizeMaskedScatter(VectorType *DataType,
		Align Alignment) const {
		return false;
		}

bool isLegalMaskedCompressStore(Type *DataType) const { return false; }		bool isLegalMaskedCompressStore(Type *DataType) const { return false; }

bool isLegalMaskedExpandLoad(Type *DataType) const { return false; }		bool isLegalMaskedExpandLoad(Type *DataType) const { return false; }

bool enableOrderedReductions() const { return false; }		bool enableOrderedReductions() const { return false; }

bool hasDivRemOp(Type *DataType, bool IsSigned) const { return false; }		bool hasDivRemOp(Type *DataType, bool IsSigned) const { return false; }

▲ Show 20 Lines • Show All 945 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 402 Lines • ▼ Show 20 Lines	bool TargetTransformInfo::isLegalMaskedGather(Type *DataType,
return TTIImpl->isLegalMaskedGather(DataType, Alignment);		return TTIImpl->isLegalMaskedGather(DataType, Alignment);
}		}

bool TargetTransformInfo::isLegalMaskedScatter(Type *DataType,		bool TargetTransformInfo::isLegalMaskedScatter(Type *DataType,
Align Alignment) const {		Align Alignment) const {
return TTIImpl->isLegalMaskedScatter(DataType, Alignment);		return TTIImpl->isLegalMaskedScatter(DataType, Alignment);
}		}

		bool TargetTransformInfo::forceScalarizeMaskedGather(VectorType *DataType,
		Align Alignment) const {
		return TTIImpl->forceScalarizeMaskedGather(DataType, Alignment);
		}

		bool TargetTransformInfo::forceScalarizeMaskedScatter(VectorType *DataType,
		Align Alignment) const {
		return TTIImpl->forceScalarizeMaskedScatter(DataType, Alignment);
		}

bool TargetTransformInfo::isLegalMaskedCompressStore(Type *DataType) const {		bool TargetTransformInfo::isLegalMaskedCompressStore(Type *DataType) const {
return TTIImpl->isLegalMaskedCompressStore(DataType);		return TTIImpl->isLegalMaskedCompressStore(DataType);
}		}

bool TargetTransformInfo::isLegalMaskedExpandLoad(Type *DataType) const {		bool TargetTransformInfo::isLegalMaskedExpandLoad(Type *DataType) const {
return TTIImpl->isLegalMaskedExpandLoad(DataType);		return TTIImpl->isLegalMaskedExpandLoad(DataType);
}		}

▲ Show 20 Lines • Show All 778 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 183 Lines • ▼ Show 20 Lines	public:
bool isProfitableLSRChainElement(Instruction *I);		bool isProfitableLSRChainElement(Instruction *I);

bool isLegalMaskedLoad(Type *DataTy, Align Alignment);		bool isLegalMaskedLoad(Type *DataTy, Align Alignment);

bool isLegalMaskedStore(Type *DataTy, Align Alignment) {		bool isLegalMaskedStore(Type *DataTy, Align Alignment) {
return isLegalMaskedLoad(DataTy, Alignment);		return isLegalMaskedLoad(DataTy, Alignment);
}		}

		bool forceScalarizeMaskedGather(VectorType *VTy, Align Alignment) {
		// For MVE, we have a custom lowering pass that will already have custom
		// legalised any gathers that we can lower to MVE intrinsics, and want to
		sdesmalenUnsubmitted Done Reply Inline Actions nit: s/can to/can lower to/ sdesmalen: nit: s/can to/can lower to/
		// expand all the rest. The pass runs before the masked intrinsic lowering
		// pass.
		sdesmalenUnsubmitted Done Reply Inline Actions nit: s/, so if we are here, we know we want to expand// sdesmalen: nit: s/, so if we are here, we know we want to expand//
		return true;
		}

		bool forceScalarizeMaskedScatter(VectorType *VTy, Align Alignment) {
		return forceScalarizeMaskedGather(VTy, Alignment);
		}

bool isLegalMaskedGather(Type *Ty, Align Alignment);		bool isLegalMaskedGather(Type *Ty, Align Alignment);

bool isLegalMaskedScatter(Type *Ty, Align Alignment) {		bool isLegalMaskedScatter(Type *Ty, Align Alignment) {
return isLegalMaskedGather(Ty, Alignment);		return isLegalMaskedGather(Ty, Alignment);
}		}

InstructionCost getMemcpyCost(const Instruction *I);		InstructionCost getMemcpyCost(const Instruction *I);

▲ Show 20 Lines • Show All 136 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 1,110 Lines • ▼ Show 20 Lines	bool ARMTTIImpl::isLegalMaskedLoad(Type *DataTy, Align Alignment) {
return (EltWidth == 32 && Alignment >= 4) \|\|		return (EltWidth == 32 && Alignment >= 4) \|\|
(EltWidth == 16 && Alignment >= 2) \|\| (EltWidth == 8);		(EltWidth == 16 && Alignment >= 2) \|\| (EltWidth == 8);
}		}

bool ARMTTIImpl::isLegalMaskedGather(Type *Ty, Align Alignment) {		bool ARMTTIImpl::isLegalMaskedGather(Type *Ty, Align Alignment) {
if (!EnableMaskedGatherScatters \|\| !ST->hasMVEIntegerOps())		if (!EnableMaskedGatherScatters \|\| !ST->hasMVEIntegerOps())
return false;		return false;

// This method is called in 2 places:
// - from the vectorizer with a scalar type, in which case we need to get
// this as good as we can with the limited info we have (and rely on the cost
// model for the rest).
// - from the masked intrinsic lowering pass with the actual vector type.
// For MVE, we have a custom lowering pass that will already have custom
// legalised any gathers that we can to MVE intrinsics, and want to expand all
// the rest. The pass runs before the masked intrinsic lowering pass, so if we
// are here, we know we want to expand.
if (isa<VectorType>(Ty))
return false;

unsigned EltWidth = Ty->getScalarSizeInBits();		unsigned EltWidth = Ty->getScalarSizeInBits();
return ((EltWidth == 32 && Alignment >= 4) \|\|		return ((EltWidth == 32 && Alignment >= 4) \|\|
(EltWidth == 16 && Alignment >= 2) \|\| EltWidth == 8);		(EltWidth == 16 && Alignment >= 2) \|\| EltWidth == 8);
}		}

/// Given a memcpy/memset/memmove instruction, return the number of memory		/// Given a memcpy/memset/memmove instruction, return the number of memory
/// operations performed, via querying findOptimalMemOpLowering. Returns -1 if a		/// operations performed, via querying findOptimalMemOpLowering. Returns -1 if a
/// call is used.		/// call is used.
▲ Show 20 Lines • Show All 1,220 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 220 Lines • ▼ Show 20 Lines	InstructionCost getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);
bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2);		TargetTransformInfo::LSRCost &C2);
bool canMacroFuseCmp();		bool canMacroFuseCmp();
bool isLegalMaskedLoad(Type *DataType, Align Alignment);		bool isLegalMaskedLoad(Type *DataType, Align Alignment);
bool isLegalMaskedStore(Type *DataType, Align Alignment);		bool isLegalMaskedStore(Type *DataType, Align Alignment);
bool isLegalNTLoad(Type *DataType, Align Alignment);		bool isLegalNTLoad(Type *DataType, Align Alignment);
bool isLegalNTStore(Type *DataType, Align Alignment);		bool isLegalNTStore(Type *DataType, Align Alignment);
		bool forceScalarizeMaskedGather(VectorType *VTy, Align Alignment);
		bool forceScalarizeMaskedScatter(VectorType *VTy, Align Alignment) {
		return forceScalarizeMaskedGather(VTy, Alignment);
		}
bool isLegalMaskedGather(Type *DataType, Align Alignment);		bool isLegalMaskedGather(Type *DataType, Align Alignment);
bool isLegalMaskedScatter(Type *DataType, Align Alignment);		bool isLegalMaskedScatter(Type *DataType, Align Alignment);
bool isLegalMaskedExpandLoad(Type *DataType);		bool isLegalMaskedExpandLoad(Type *DataType);
bool isLegalMaskedCompressStore(Type *DataType);		bool isLegalMaskedCompressStore(Type *DataType);
bool hasDivRemOp(Type *DataType, bool IsSigned);		bool hasDivRemOp(Type *DataType, bool IsSigned);
bool isFCmpOrdCheaperThanFCmpZero(Type *Ty);		bool isFCmpOrdCheaperThanFCmpZero(Type *Ty);
bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;
Show All 26 Lines

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

	Show First 20 Lines • Show All 4,989 Lines • ▼ Show 20 Lines

	/// Calculate the cost of Gather / Scatter operation			/// Calculate the cost of Gather / Scatter operation
	InstructionCost X86TTIImpl::getGatherScatterOpCost(			InstructionCost X86TTIImpl::getGatherScatterOpCost(
	unsigned Opcode, Type SrcVTy, const Value Ptr, bool VariableMask,			unsigned Opcode, Type SrcVTy, const Value Ptr, bool VariableMask,
	Align Alignment, TTI::TargetCostKind CostKind,			Align Alignment, TTI::TargetCostKind CostKind,
	const Instruction *I = nullptr) {			const Instruction *I = nullptr) {
	if (CostKind != TTI::TCK_RecipThroughput) {			if (CostKind != TTI::TCK_RecipThroughput) {
	if ((Opcode == Instruction::Load &&			if ((Opcode == Instruction::Load &&
	isLegalMaskedGather(SrcVTy, Align(Alignment))) \|\|			isLegalMaskedGather(SrcVTy, Align(Alignment)) &&
				!forceScalarizeMaskedGather(cast<VectorType>(SrcVTy),
				Align(Alignment))) \|\|
	(Opcode == Instruction::Store &&			(Opcode == Instruction::Store &&
	isLegalMaskedScatter(SrcVTy, Align(Alignment))))			isLegalMaskedScatter(SrcVTy, Align(Alignment)) &&
				!forceScalarizeMaskedScatter(cast<VectorType>(SrcVTy),
				Align(Alignment))))
	return 1;			return 1;
	return BaseT::getGatherScatterOpCost(Opcode, SrcVTy, Ptr, VariableMask,			return BaseT::getGatherScatterOpCost(Opcode, SrcVTy, Ptr, VariableMask,
	Alignment, CostKind, I);			Alignment, CostKind, I);
	}			}

	assert(SrcVTy->isVectorTy() && "Unexpected data type for Gather/Scatter");			assert(SrcVTy->isVectorTy() && "Unexpected data type for Gather/Scatter");
	PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());			PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());
	if (!PtrTy && Ptr->getType()->isVectorTy())			if (!PtrTy && Ptr->getType()->isVectorTy())
	PtrTy = dyn_cast<PointerType>(			PtrTy = dyn_cast<PointerType>(
	cast<VectorType>(Ptr->getType())->getElementType());			cast<VectorType>(Ptr->getType())->getElementType());
	assert(PtrTy && "Unexpected type for Ptr argument");			assert(PtrTy && "Unexpected type for Ptr argument");
	unsigned AddressSpace = PtrTy->getAddressSpace();			unsigned AddressSpace = PtrTy->getAddressSpace();

	if ((Opcode == Instruction::Load &&			if ((Opcode == Instruction::Load &&
	!isLegalMaskedGather(SrcVTy, Align(Alignment))) \|\|			(!isLegalMaskedGather(SrcVTy, Align(Alignment)) \|\|
				forceScalarizeMaskedGather(cast<VectorType>(SrcVTy),
				Align(Alignment)))) \|\|
	(Opcode == Instruction::Store &&			(Opcode == Instruction::Store &&
	!isLegalMaskedScatter(SrcVTy, Align(Alignment))))			(!isLegalMaskedScatter(SrcVTy, Align(Alignment)) \|\|
				forceScalarizeMaskedScatter(cast<VectorType>(SrcVTy),
				Align(Alignment)))))
	return getGSScalarCost(Opcode, SrcVTy, VariableMask, Alignment,			return getGSScalarCost(Opcode, SrcVTy, VariableMask, Alignment,
	AddressSpace);			AddressSpace);

	return getGSVectorCost(Opcode, SrcVTy, Ptr, Alignment, AddressSpace);			return getGSVectorCost(Opcode, SrcVTy, Ptr, Alignment, AddressSpace);
	}			}

	bool X86TTIImpl::isLSRCostLess(TargetTransformInfo::LSRCost &C1,			bool X86TTIImpl::isLSRCostLess(TargetTransformInfo::LSRCost &C1,
	TargetTransformInfo::LSRCost &C2) {			TargetTransformInfo::LSRCost &C2) {
	▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines

	bool X86TTIImpl::supportsGather() const {			bool X86TTIImpl::supportsGather() const {
	// Some CPUs have better gather performance than others.			// Some CPUs have better gather performance than others.
	// TODO: Remove the explicit ST->hasAVX512()?, That would mean we would only			// TODO: Remove the explicit ST->hasAVX512()?, That would mean we would only
	// enable gather with a -march.			// enable gather with a -march.
	return ST->hasAVX512() \|\| (ST->hasFastGather() && ST->hasAVX2());			return ST->hasAVX512() \|\| (ST->hasFastGather() && ST->hasAVX2());
	}			}

				bool X86TTIImpl::forceScalarizeMaskedGather(VectorType *VTy, Align Alignment) {
				// Gather / Scatter for vector 2 is not profitable on KNL / SKX
				// Vector-4 of gather/scatter instruction does not exist on KNL. We can extend
				// it to 8 elements, but zeroing upper bits of the mask vector will add more
				// instructions. Right now we give the scalar cost of vector-4 for KNL. TODO:
				// Check, maybe the gather/scatter instruction is better in the VariableMask
				// case.
				unsigned NumElts = cast<FixedVectorType>(VTy)->getNumElements();
				sdesmalenUnsubmitted Done Reply Inline Actions This is casting to `FixedVectorType`, so I think we should we make the interface take `VectorType` instead of `Type`. sdesmalen: This is casting to `FixedVectorType`, so I think we should we make the interface take…
				return NumElts == 1 \|\|
				(ST->hasAVX512() && (NumElts == 2 \|\| (NumElts == 4 && !ST->hasVLX())));
				}

	bool X86TTIImpl::isLegalMaskedGather(Type *DataTy, Align Alignment) {			bool X86TTIImpl::isLegalMaskedGather(Type *DataTy, Align Alignment) {
	if (!supportsGather())			if (!supportsGather())
	return false;			return false;

	// This function is called now in two cases: from the Loop Vectorizer
	// and from the Scalarizer.
	// When the Loop Vectorizer asks about legality of the feature,
	// the vectorization factor is not calculated yet. The Loop Vectorizer
	// sends a scalar type and the decision is based on the width of the
	// scalar element.
	// Later on, the cost model will estimate usage this intrinsic based on
	// the vector type.
	// The Scalarizer asks again about legality. It sends a vector type.
	// In this case we can reject non-power-of-2 vectors.
	// We also reject single element vectors as the type legalizer can't
	// scalarize it.
	if (auto *DataVTy = dyn_cast<FixedVectorType>(DataTy)) {
	unsigned NumElts = DataVTy->getNumElements();
	if (NumElts == 1)
	return false;
	// Gather / Scatter for vector 2 is not profitable on KNL / SKX
	// Vector-4 of gather/scatter instruction does not exist on KNL.
	// We can extend it to 8 elements, but zeroing upper bits of
	// the mask vector will add more instructions. Right now we give the scalar
	// cost of vector-4 for KNL. TODO: Check, maybe the gather/scatter
	// instruction is better in the VariableMask case.
	if (ST->hasAVX512() && (NumElts == 2 \|\| (NumElts == 4 && !ST->hasVLX())))
	return false;
	}
	Type *ScalarTy = DataTy->getScalarType();			Type *ScalarTy = DataTy->getScalarType();
				lebedev.riUnsubmitted Done Reply Inline Actions Can you move this code into `forceScalarizeMaskedGather()`? This should not have been here, LV should rely on cost model. lebedev.ri: Can you move this code into `forceScalarizeMaskedGather()`? This should not have been here, LV…
	if (ScalarTy->isPointerTy())			if (ScalarTy->isPointerTy())
	return true;			return true;

	if (ScalarTy->isFloatTy() \|\| ScalarTy->isDoubleTy())			if (ScalarTy->isFloatTy() \|\| ScalarTy->isDoubleTy())
	return true;			return true;

	if (!ScalarTy->isIntegerTy())			if (!ScalarTy->isIntegerTy())
	return false;			return false;
	▲ Show 20 Lines • Show All 607 Lines • Show Last 20 Lines

llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp

Show First 20 Lines • Show All 953 Lines • ▼ Show 20 Lines	case Intrinsic::masked_store:
scalarizeMaskedStore(DL, CI, DTU, ModifiedDT);		scalarizeMaskedStore(DL, CI, DTU, ModifiedDT);
return true;		return true;
case Intrinsic::masked_gather: {		case Intrinsic::masked_gather: {
MaybeAlign MA =		MaybeAlign MA =
cast<ConstantInt>(CI->getArgOperand(1))->getMaybeAlignValue();		cast<ConstantInt>(CI->getArgOperand(1))->getMaybeAlignValue();
Type *LoadTy = CI->getType();		Type *LoadTy = CI->getType();
Align Alignment = DL.getValueOrABITypeAlignment(MA,		Align Alignment = DL.getValueOrABITypeAlignment(MA,
LoadTy->getScalarType());		LoadTy->getScalarType());
if (TTI.isLegalMaskedGather(LoadTy, Alignment))		if (TTI.isLegalMaskedGather(LoadTy, Alignment) &&
		!TTI.forceScalarizeMaskedGather(cast<VectorType>(LoadTy), Alignment))
return false;		return false;
scalarizeMaskedGather(DL, CI, DTU, ModifiedDT);		scalarizeMaskedGather(DL, CI, DTU, ModifiedDT);
return true;		return true;
}		}
case Intrinsic::masked_scatter: {		case Intrinsic::masked_scatter: {
MaybeAlign MA =		MaybeAlign MA =
cast<ConstantInt>(CI->getArgOperand(2))->getMaybeAlignValue();		cast<ConstantInt>(CI->getArgOperand(2))->getMaybeAlignValue();
Type *StoreTy = CI->getArgOperand(0)->getType();		Type *StoreTy = CI->getArgOperand(0)->getType();
Align Alignment = DL.getValueOrABITypeAlignment(MA,		Align Alignment = DL.getValueOrABITypeAlignment(MA,
StoreTy->getScalarType());		StoreTy->getScalarType());
if (TTI.isLegalMaskedScatter(StoreTy, Alignment))		if (TTI.isLegalMaskedScatter(StoreTy, Alignment) &&
		!TTI.forceScalarizeMaskedScatter(cast<VectorType>(StoreTy),
		Alignment))
return false;		return false;
scalarizeMaskedScatter(DL, CI, DTU, ModifiedDT);		scalarizeMaskedScatter(DL, CI, DTU, ModifiedDT);
return true;		return true;
}		}
case Intrinsic::masked_expandload:		case Intrinsic::masked_expandload:
if (TTI.isLegalMaskedExpandLoad(CI->getType()))		if (TTI.isLegalMaskedExpandLoad(CI->getType()))
return false;		return false;
scalarizeMaskedExpandLoad(DL, CI, DTU, ModifiedDT);		scalarizeMaskedExpandLoad(DL, CI, DTU, ModifiedDT);
Show All 11 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,537 Lines • ▼ Show 20 Lines	public:
/// for the given \p DataType and kind of access to \p Ptr.		/// for the given \p DataType and kind of access to \p Ptr.
bool isLegalMaskedLoad(Type DataType, Value Ptr, Align Alignment) const {		bool isLegalMaskedLoad(Type DataType, Value Ptr, Align Alignment) const {
return Legal->isConsecutivePtr(DataType, Ptr) &&		return Legal->isConsecutivePtr(DataType, Ptr) &&
TTI.isLegalMaskedLoad(DataType, Alignment);		TTI.isLegalMaskedLoad(DataType, Alignment);
}		}

/// Returns true if the target machine can represent \p V as a masked gather		/// Returns true if the target machine can represent \p V as a masked gather
/// or scatter operation.		/// or scatter operation.
bool isLegalGatherOrScatter(Value *V) {		bool isLegalGatherOrScatter(Value *V,
		ElementCount VF = ElementCount::getFixed(1)) {
bool LI = isa<LoadInst>(V);		bool LI = isa<LoadInst>(V);
bool SI = isa<StoreInst>(V);		bool SI = isa<StoreInst>(V);
if (!LI && !SI)		if (!LI && !SI)
return false;		return false;
auto *Ty = getLoadStoreType(V);		auto *Ty = getLoadStoreType(V);
Align Align = getLoadStoreAlignment(V);		Align Align = getLoadStoreAlignment(V);
		if (VF.isVector())
		Ty = VectorType::get(Ty, VF);
return (LI && TTI.isLegalMaskedGather(Ty, Align)) \|\|		return (LI && TTI.isLegalMaskedGather(Ty, Align)) \|\|
		david-armUnsubmitted Not Done Reply Inline Actions According to the default `VF = ElementCount::getFixed(1)` it looks like we can still have an inconsistency during vectorisation where we get different answers? I think it's also called from `collectElementTypesForWidening` so we could potentially say a gather is legal when the type is scalar, but then say it's illegal in setCostBasedWideningDecision. I wonder if it's worth adding a comment there saying something like: // We're simply querying at this point if the target even supports any vector // gathers and scatters for the given element type? Certain vector forms may // still be illegal, but setCostBasedWideningDecision will distinguish between // the legal behaviour for different VFs at that point and generate costs accordingly. Also, does it matter that `LoopVectorizationCostModel::isScalarWithPredication` is still passing in an element type rather than a vector? For example, `collectLoopUniforms` calls `isScalarWithPredication` and at that point knows the VF. It may be fine, but I'm just a bit worried we might be swapping one inconsistency for another that's all. david-arm: According to the default `VF = ElementCount::getFixed(1)` it looks like we can still have an…
		sdesmalenUnsubmitted Not Done Reply Inline Actions The code in `collectElementTypesForWidening` seems like a hack to get the LV to choose a wider VF and it seems specific to X86. If I replace that whole expression with `if (!T->isSized()) continue`, I only get a single test that fails because it explicitly tests for it to use a wider VF. I doubt this is the right mechanism to use for that kind of purpose going forward. In any case, if `collectElementTypesForWidening` would consider using a wider VF, it's still the cost-model that will then narrow that down (after this patch), because passing in the VF will result in a more restricted answer for `isLegalMaskedGather`. Also, does it matter that LoopVectorizationCostModel::isScalarWithPredication is still passing in an element type rather than a vector? For example, collectLoopUniforms calls isScalarWithPredication and at that point knows the VF. It may be fine, but I'm just a bit worried we might be swapping one inconsistency for another that's all. I don't know if there's an actual problem atm, but I agree it makes sense to pass it to isScalarWithPredication as well. It should be possible since VF is available in all functions that call it. sdesmalen: The code in `collectElementTypesForWidening` seems like a hack to get the LV to choose a wider…
(SI && TTI.isLegalMaskedScatter(Ty, Align));		(SI && TTI.isLegalMaskedScatter(Ty, Align));
}		}

/// Returns true if the target machine supports all of the reduction		/// Returns true if the target machine supports all of the reduction
/// variables found for the given VF.		/// variables found for the given VF.
bool canVectorizeReductions(ElementCount VF) const {		bool canVectorizeReductions(ElementCount VF) const {
return (all_of(Legal->getReductionVars(), [&](auto &Reduction) -> bool {		return (all_of(Legal->getReductionVars(), [&](auto &Reduction) -> bool {
const RecurrenceDescriptor &RdxDesc = Reduction.second;		const RecurrenceDescriptor &RdxDesc = Reduction.second;
return TTI.isLegalToVectorizeReduction(RdxDesc, VF);		return TTI.isLegalToVectorizeReduction(RdxDesc, VF);
}));		}));
}		}

/// Returns true if \p I is an instruction that will be scalarized with		/// Returns true if \p I is an instruction that will be scalarized with
/// predication. Such instructions include conditional stores and		/// predication when vectorizing \p I with vectorization factor \p VF. Such
/// instructions that may divide by zero.		/// instructions include conditional stores and instructions that may divide
/// If a non-zero VF has been calculated, we check if I will be scalarized		/// by zero.
/// predication for that VF.		bool isScalarWithPredication(Instruction *I, ElementCount VF) const;
		fhahnUnsubmitted Done Reply Inline Actions documentation needs to be updated. fhahn: documentation needs to be updated.
bool isScalarWithPredication(Instruction *I) const;

// Returns true if \p I is an instruction that will be predicated either		// Returns true if \p I is an instruction that will be predicated either
// through scalar predication or masked load/store or masked gather/scatter.		// through scalar predication or masked load/store or masked gather/scatter.
		// \p VF is the vectorization factor that will be used to vectorize \p I.
// Superset of instructions that return true for isScalarWithPredication.		// Superset of instructions that return true for isScalarWithPredication.
bool isPredicatedInst(Instruction *I, bool IsKnownUniform = false) {		bool isPredicatedInst(Instruction *I, ElementCount VF,
		bool IsKnownUniform = false) {
		fhahnUnsubmitted Done Reply Inline Actions VF should be documented. fhahn: VF should be documented.
// When we know the load is uniform and the original scalar loop was not		// When we know the load is uniform and the original scalar loop was not
// predicated we don't need to mark it as a predicated instruction. Any		// predicated we don't need to mark it as a predicated instruction. Any
// vectorised blocks created when tail-folding are something artificial we		// vectorised blocks created when tail-folding are something artificial we
// have introduced and we know there is always at least one active lane.		// have introduced and we know there is always at least one active lane.
// That's why we call Legal->blockNeedsPredication here because it doesn't		// That's why we call Legal->blockNeedsPredication here because it doesn't
// query tail-folding.		// query tail-folding.
if (IsKnownUniform && isa<LoadInst>(I) &&		if (IsKnownUniform && isa<LoadInst>(I) &&
!Legal->blockNeedsPredication(I->getParent()))		!Legal->blockNeedsPredication(I->getParent()))
return false;		return false;
if (!blockNeedsPredicationForAnyReason(I->getParent()))		if (!blockNeedsPredicationForAnyReason(I->getParent()))
return false;		return false;
// Loads and stores that need some form of masked operation are predicated		// Loads and stores that need some form of masked operation are predicated
// instructions.		// instructions.
if (isa<LoadInst>(I) \|\| isa<StoreInst>(I))		if (isa<LoadInst>(I) \|\| isa<StoreInst>(I))
return Legal->isMaskRequired(I);		return Legal->isMaskRequired(I);
return isScalarWithPredication(I);		return isScalarWithPredication(I, VF);
}		}

/// Returns true if \p I is a memory instruction with consecutive memory		/// Returns true if \p I is a memory instruction with consecutive memory
/// access that can be widened.		/// access that can be widened.
bool		bool
memoryInstructionCanBeWidened(Instruction *I,		memoryInstructionCanBeWidened(Instruction *I,
ElementCount VF = ElementCount::getFixed(1));		ElementCount VF = ElementCount::getFixed(1));

▲ Show 20 Lines • Show All 175 Lines • ▼ Show 20 Lines	InstructionCost getScalarizationOverhead(Instruction *I,
ElementCount VF) const;		ElementCount VF) const;

/// Returns whether the instruction is a load or store and will be a emitted		/// Returns whether the instruction is a load or store and will be a emitted
/// as a vector operation.		/// as a vector operation.
bool isConsecutiveLoadOrStore(Instruction *I);		bool isConsecutiveLoadOrStore(Instruction *I);

/// Returns true if an artificially high cost for emulated masked memrefs		/// Returns true if an artificially high cost for emulated masked memrefs
/// should be used.		/// should be used.
bool useEmulatedMaskMemRefHack(Instruction *I);		bool useEmulatedMaskMemRefHack(Instruction *I, ElementCount VF);

/// Map of scalar integer values to the smallest bitwidth they can be legally		/// Map of scalar integer values to the smallest bitwidth they can be legally
/// represented as. The vector equivalents of these values should be truncated		/// represented as. The vector equivalents of these values should be truncated
/// to this type.		/// to this type.
MapVector<Instruction *, uint64_t> MinBWs;		MapVector<Instruction *, uint64_t> MinBWs;

/// A type representing the costs for instructions if they were to be		/// A type representing the costs for instructions if they were to be
/// scalarized rather than vectorized. The entries are Instruction-Cost		/// scalarized rather than vectorized. The entries are Instruction-Cost
▲ Show 20 Lines • Show All 3,066 Lines • ▼ Show 20 Lines	for (auto &Induction : Legal->getInductionVars()) {
LLVM_DEBUG(dbgs() << "LV: Found scalar instruction: " << *Ind << "\n");		LLVM_DEBUG(dbgs() << "LV: Found scalar instruction: " << *Ind << "\n");
LLVM_DEBUG(dbgs() << "LV: Found scalar instruction: " << *IndUpdate		LLVM_DEBUG(dbgs() << "LV: Found scalar instruction: " << *IndUpdate
<< "\n");		<< "\n");
}		}

Scalars[VF].insert(Worklist.begin(), Worklist.end());		Scalars[VF].insert(Worklist.begin(), Worklist.end());
}		}

bool LoopVectorizationCostModel::isScalarWithPredication(Instruction *I) const {		bool LoopVectorizationCostModel::isScalarWithPredication(
		Instruction *I, ElementCount VF) const {
if (!blockNeedsPredicationForAnyReason(I->getParent()))		if (!blockNeedsPredicationForAnyReason(I->getParent()))
return false;		return false;
switch(I->getOpcode()) {		switch(I->getOpcode()) {
default:		default:
break;		break;
case Instruction::Load:		case Instruction::Load:
case Instruction::Store: {		case Instruction::Store: {
if (!Legal->isMaskRequired(I))		if (!Legal->isMaskRequired(I))
return false;		return false;
auto *Ptr = getLoadStorePointerOperand(I);		auto *Ptr = getLoadStorePointerOperand(I);
auto *Ty = getLoadStoreType(I);		auto *Ty = getLoadStoreType(I);
		Type *VTy = Ty;
		if (VF.isVector())
		VTy = VectorType::get(Ty, VF);
const Align Alignment = getLoadStoreAlignment(I);		const Align Alignment = getLoadStoreAlignment(I);
return isa<LoadInst>(I) ? !(isLegalMaskedLoad(Ty, Ptr, Alignment) \|\|		return isa<LoadInst>(I) ? !(isLegalMaskedLoad(Ty, Ptr, Alignment) \|\|
TTI.isLegalMaskedGather(Ty, Alignment))		TTI.isLegalMaskedGather(VTy, Alignment))
: !(isLegalMaskedStore(Ty, Ptr, Alignment) \|\|		: !(isLegalMaskedStore(Ty, Ptr, Alignment) \|\|
TTI.isLegalMaskedScatter(Ty, Alignment));		TTI.isLegalMaskedScatter(VTy, Alignment));
}		}
case Instruction::UDiv:		case Instruction::UDiv:
case Instruction::SDiv:		case Instruction::SDiv:
case Instruction::SRem:		case Instruction::SRem:
case Instruction::URem:		case Instruction::URem:
return mayDivideByZero(*I);		return mayDivideByZero(*I);
}		}
return false;		return false;
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	bool LoopVectorizationCostModel::memoryInstructionCanBeWidened(
auto *ScalarTy = getLoadStoreType(I);		auto *ScalarTy = getLoadStoreType(I);

// In order to be widened, the pointer should be consecutive, first of all.		// In order to be widened, the pointer should be consecutive, first of all.
if (!Legal->isConsecutivePtr(ScalarTy, Ptr))		if (!Legal->isConsecutivePtr(ScalarTy, Ptr))
return false;		return false;

// If the instruction is a store located in a predicated block, it will be		// If the instruction is a store located in a predicated block, it will be
// scalarized.		// scalarized.
if (isScalarWithPredication(I))		if (isScalarWithPredication(I, VF))
return false;		return false;

// If the instruction's allocated size doesn't equal it's type size, it		// If the instruction's allocated size doesn't equal it's type size, it
// requires padding and will be scalarized.		// requires padding and will be scalarized.
auto &DL = I->getModule()->getDataLayout();		auto &DL = I->getModule()->getDataLayout();
if (hasIrregularType(ScalarTy, DL))		if (hasIrregularType(ScalarTy, DL))
return false;		return false;

Show All 34 Lines	void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) {
// where only a single instance out of VF should be formed.		// where only a single instance out of VF should be formed.
// TODO: optimize such seldom cases if found important, see PR40816.		// TODO: optimize such seldom cases if found important, see PR40816.
auto addToWorklistIfAllowed = [&](Instruction *I) -> void {		auto addToWorklistIfAllowed = [&](Instruction *I) -> void {
if (isOutOfScope(I)) {		if (isOutOfScope(I)) {
LLVM_DEBUG(dbgs() << "LV: Found not uniform due to scope: "		LLVM_DEBUG(dbgs() << "LV: Found not uniform due to scope: "
<< *I << "\n");		<< *I << "\n");
return;		return;
}		}
if (isScalarWithPredication(I)) {		if (isScalarWithPredication(I, VF)) {
LLVM_DEBUG(dbgs() << "LV: Found not uniform being ScalarWithPredication: "		LLVM_DEBUG(dbgs() << "LV: Found not uniform being ScalarWithPredication: "
<< *I << "\n");		<< *I << "\n");
return;		return;
}		}
LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *I << "\n");		LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *I << "\n");
Worklist.insert(I);		Worklist.insert(I);
};		};

▲ Show 20 Lines • Show All 1,412 Lines • ▼ Show 20 Lines	for (unsigned i = 0, e = VFs.size(); i < e; ++i) {
RU.LoopInvariantRegs = Invariant;		RU.LoopInvariantRegs = Invariant;
RU.MaxLocalUsers = MaxUsages[i];		RU.MaxLocalUsers = MaxUsages[i];
RUs[i] = RU;		RUs[i] = RU;
}		}

return RUs;		return RUs;
}		}

bool LoopVectorizationCostModel::useEmulatedMaskMemRefHack(Instruction *I){		bool LoopVectorizationCostModel::useEmulatedMaskMemRefHack(Instruction *I,
		ElementCount VF) {
// TODO: Cost model for emulated masked load/store is completely		// TODO: Cost model for emulated masked load/store is completely
// broken. This hack guides the cost model to use an artificially		// broken. This hack guides the cost model to use an artificially
// high enough value to practically disable vectorization with such		// high enough value to practically disable vectorization with such
// operations, except where previously deployed legality hack allowed		// operations, except where previously deployed legality hack allowed
// using very low cost values. This is to avoid regressions coming simply		// using very low cost values. This is to avoid regressions coming simply
// from moving "masked load/store" check from legality to cost model.		// from moving "masked load/store" check from legality to cost model.
// Masked Load/Gather emulation was previously never allowed.		// Masked Load/Gather emulation was previously never allowed.
// Limited number of Masked Store/Scatter emulation was allowed.		// Limited number of Masked Store/Scatter emulation was allowed.
assert(isPredicatedInst(I) &&		assert(isPredicatedInst(I, VF) && "Expecting a scalar emulated instruction");
"Expecting a scalar emulated instruction");
return isa<LoadInst>(I) \|\|		return isa<LoadInst>(I) \|\|
(isa<StoreInst>(I) &&		(isa<StoreInst>(I) &&
NumPredStores > NumberOfStoresToPredicate);		NumPredStores > NumberOfStoresToPredicate);
}		}

void LoopVectorizationCostModel::collectInstsToScalarize(ElementCount VF) {		void LoopVectorizationCostModel::collectInstsToScalarize(ElementCount VF) {
// If we aren't vectorizing the loop, or if we've already collected the		// If we aren't vectorizing the loop, or if we've already collected the
// instructions to scalarize, there's nothing to do. Collection may already		// instructions to scalarize, there's nothing to do. Collection may already
Show All 10 Lines	void LoopVectorizationCostModel::collectInstsToScalarize(ElementCount VF) {

// Find all the instructions that are scalar with predication in the loop and		// Find all the instructions that are scalar with predication in the loop and
// determine if it would be better to not if-convert the blocks they are in.		// determine if it would be better to not if-convert the blocks they are in.
// If so, we also record the instructions to scalarize.		// If so, we also record the instructions to scalarize.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
if (!blockNeedsPredicationForAnyReason(BB))		if (!blockNeedsPredicationForAnyReason(BB))
continue;		continue;
for (Instruction &I : *BB)		for (Instruction &I : *BB)
if (isScalarWithPredication(&I)) {		if (isScalarWithPredication(&I, VF)) {
ScalarCostsTy ScalarCosts;		ScalarCostsTy ScalarCosts;
// Do not apply discount if scalable, because that would lead to		// Do not apply discount if scalable, because that would lead to
// invalid scalarization costs.		// invalid scalarization costs.
// Do not apply discount logic if hacked cost is needed		// Do not apply discount logic if hacked cost is needed
// for emulated masked memrefs.		// for emulated masked memrefs.
if (!VF.isScalable() && !useEmulatedMaskMemRefHack(&I) &&		if (!VF.isScalable() && !useEmulatedMaskMemRefHack(&I, VF) &&
computePredInstDiscount(&I, ScalarCosts, VF) >= 0)		computePredInstDiscount(&I, ScalarCosts, VF) >= 0)
ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end());		ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end());
// Remember that BB will remain after vectorization.		// Remember that BB will remain after vectorization.
PredicatedBBsAfterVectorization.insert(BB);		PredicatedBBsAfterVectorization.insert(BB);
}		}
}		}
}		}

Show All 19 Lines	auto canBeScalarized = [&](Instruction *I) -> bool {
// already be scalar to avoid traversing chains that are unlikely to be		// already be scalar to avoid traversing chains that are unlikely to be
// beneficial.		// beneficial.
if (!I->hasOneUse() \|\| PredInst->getParent() != I->getParent() \|\|		if (!I->hasOneUse() \|\| PredInst->getParent() != I->getParent() \|\|
isScalarAfterVectorization(I, VF))		isScalarAfterVectorization(I, VF))
return false;		return false;

// If the instruction is scalar with predication, it will be analyzed		// If the instruction is scalar with predication, it will be analyzed
// separately. We ignore it within the context of PredInst.		// separately. We ignore it within the context of PredInst.
if (isScalarWithPredication(I))		if (isScalarWithPredication(I, VF))
return false;		return false;

// If any of the instruction's operands are uniform after vectorization,		// If any of the instruction's operands are uniform after vectorization,
// the instruction cannot be scalarized. This prevents, for example, a		// the instruction cannot be scalarized. This prevents, for example, a
// masked load from being scalarized.		// masked load from being scalarized.
//		//
// We assume we will only emit a value for lane zero of an instruction		// We assume we will only emit a value for lane zero of an instruction
// marked uniform after vectorization, rather than VF identical values.		// marked uniform after vectorization, rather than VF identical values.
Show All 30 Lines	while (!Worklist.empty()) {
// predicated block. We will scale this cost by block probability after		// predicated block. We will scale this cost by block probability after
// computing the scalarization overhead.		// computing the scalarization overhead.
InstructionCost ScalarCost =		InstructionCost ScalarCost =
VF.getFixedValue() *		VF.getFixedValue() *
getInstructionCost(I, ElementCount::getFixed(1)).first;		getInstructionCost(I, ElementCount::getFixed(1)).first;

// Compute the scalarization overhead of needed insertelement instructions		// Compute the scalarization overhead of needed insertelement instructions
// and phi nodes.		// and phi nodes.
if (isScalarWithPredication(I) && !I->getType()->isVoidTy()) {		if (isScalarWithPredication(I, VF) && !I->getType()->isVoidTy()) {
ScalarCost += TTI.getScalarizationOverhead(		ScalarCost += TTI.getScalarizationOverhead(
cast<VectorType>(ToVectorTy(I->getType(), VF)),		cast<VectorType>(ToVectorTy(I->getType(), VF)),
APInt::getAllOnes(VF.getFixedValue()), true, false);		APInt::getAllOnes(VF.getFixedValue()), true, false);
ScalarCost +=		ScalarCost +=
VF.getFixedValue() *		VF.getFixedValue() *
TTI.getCFInstrCost(Instruction::PHI, TTI::TCK_RecipThroughput);		TTI.getCFInstrCost(Instruction::PHI, TTI::TCK_RecipThroughput);
}		}

▲ Show 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,

// Get the overhead of the extractelement and insertelement instructions		// Get the overhead of the extractelement and insertelement instructions
// we might create due to scalarization.		// we might create due to scalarization.
Cost += getScalarizationOverhead(I, VF);		Cost += getScalarizationOverhead(I, VF);

// If we have a predicated load/store, it will need extra i1 extracts and		// If we have a predicated load/store, it will need extra i1 extracts and
// conditional branches, but may not be executed for each vector lane. Scale		// conditional branches, but may not be executed for each vector lane. Scale
// the cost by the probability of executing the predicated block.		// the cost by the probability of executing the predicated block.
if (isPredicatedInst(I)) {		if (isPredicatedInst(I, VF)) {
Cost /= getReciprocalPredBlockProb();		Cost /= getReciprocalPredBlockProb();

// Add the cost of an i1 extract and a branch		// Add the cost of an i1 extract and a branch
auto *Vec_i1Ty =		auto *Vec_i1Ty =
VectorType::get(IntegerType::getInt1Ty(ValTy->getContext()), VF);		VectorType::get(IntegerType::getInt1Ty(ValTy->getContext()), VF);
Cost += TTI.getScalarizationOverhead(		Cost += TTI.getScalarizationOverhead(
Vec_i1Ty, APInt::getAllOnes(VF.getKnownMinValue()),		Vec_i1Ty, APInt::getAllOnes(VF.getKnownMinValue()),
/Insert=/false, /Extract=/true);		/Insert=/false, /Extract=/true);
Cost += TTI.getCFInstrCost(Instruction::Br, TTI::TCK_RecipThroughput);		Cost += TTI.getCFInstrCost(Instruction::Br, TTI::TCK_RecipThroughput);

if (useEmulatedMaskMemRefHack(I))		if (useEmulatedMaskMemRefHack(I, VF))
// Artificially setting to a high enough value to practically disable		// Artificially setting to a high enough value to practically disable
// vectorization with such operations.		// vectorization with such operations.
Cost = 3000000;		Cost = 3000000;
}		}

return Cost;		return Cost;
}		}

▲ Show 20 Lines • Show All 390 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {
Value *Ptr = getLoadStorePointerOperand(&I);		Value *Ptr = getLoadStorePointerOperand(&I);
if (!Ptr)		if (!Ptr)
continue;		continue;

// TODO: We should generate better code and update the cost model for		// TODO: We should generate better code and update the cost model for
// predicated uniform stores. Today they are treated as any other		// predicated uniform stores. Today they are treated as any other
// predicated store (see added test cases in		// predicated store (see added test cases in
// invariant-store-vectorization.ll).		// invariant-store-vectorization.ll).
if (isa<StoreInst>(&I) && isScalarWithPredication(&I))		if (isa<StoreInst>(&I) && isScalarWithPredication(&I, VF))
NumPredStores++;		NumPredStores++;

if (Legal->isUniformMemOp(I)) {		if (Legal->isUniformMemOp(I)) {
// TODO: Avoid replicating loads and stores instead of		// TODO: Avoid replicating loads and stores instead of
// relying on instcombine to remove them.		// relying on instcombine to remove them.
// Load: Scalar load + broadcast		// Load: Scalar load + broadcast
// Store: Scalar store + isLoopInvariantStoreValue ? 0 : extract		// Store: Scalar store + isLoopInvariantStoreValue ? 0 : extract
InstructionCost Cost;		InstructionCost Cost;
if (isa<StoreInst>(&I) && VF.isScalable() &&		if (isa<StoreInst>(&I) && VF.isScalable() &&
isLegalGatherOrScatter(&I)) {		isLegalGatherOrScatter(&I, VF)) {
Cost = getGatherScatterCost(&I, VF);		Cost = getGatherScatterCost(&I, VF);
setWideningDecision(&I, VF, CM_GatherScatter, Cost);		setWideningDecision(&I, VF, CM_GatherScatter, Cost);
} else {		} else {
assert((isa<LoadInst>(&I) \|\| !VF.isScalable()) &&		assert((isa<LoadInst>(&I) \|\| !VF.isScalable()) &&
"Cannot yet scalarize uniform stores");		"Cannot yet scalarize uniform stores");
Cost = getUniformMemOpCost(&I, VF);		Cost = getUniformMemOpCost(&I, VF);
setWideningDecision(&I, VF, CM_Scalarize, Cost);		setWideningDecision(&I, VF, CM_Scalarize, Cost);
}		}
Show All 25 Lines	for (Instruction &I : *BB) {
continue;		continue;

NumAccesses = Group->getNumMembers();		NumAccesses = Group->getNumMembers();
if (interleavedAccessCanBeWidened(&I, VF))		if (interleavedAccessCanBeWidened(&I, VF))
InterleaveCost = getInterleaveGroupCost(&I, VF);		InterleaveCost = getInterleaveGroupCost(&I, VF);
}		}

InstructionCost GatherScatterCost =		InstructionCost GatherScatterCost =
isLegalGatherOrScatter(&I)		isLegalGatherOrScatter(&I, VF)
? getGatherScatterCost(&I, VF) * NumAccesses		? getGatherScatterCost(&I, VF) * NumAccesses
: InstructionCost::getInvalid();		: InstructionCost::getInvalid();

InstructionCost ScalarizationCost =		InstructionCost ScalarizationCost =
getMemInstScalarizationCost(&I, VF) * NumAccesses;		getMemInstScalarizationCost(&I, VF) * NumAccesses;

// Choose better solution for the current VF,		// Choose better solution for the current VF,
// write down this decision and use it during vectorization.		// write down this decision and use it during vectorization.
▲ Show 20 Lines • Show All 186 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
case Instruction::UDiv:		case Instruction::UDiv:
case Instruction::SDiv:		case Instruction::SDiv:
case Instruction::URem:		case Instruction::URem:
case Instruction::SRem:		case Instruction::SRem:
// If we have a predicated instruction, it may not be executed for each		// If we have a predicated instruction, it may not be executed for each
// vector lane. Get the scalarization cost and scale this amount by the		// vector lane. Get the scalarization cost and scale this amount by the
// probability of executing the predicated block. If the instruction is not		// probability of executing the predicated block. If the instruction is not
// predicated, we fall through to the next case.		// predicated, we fall through to the next case.
if (VF.isVector() && isScalarWithPredication(I)) {		if (VF.isVector() && isScalarWithPredication(I, VF)) {
InstructionCost Cost = 0;		InstructionCost Cost = 0;

// These instructions have a non-void type, so account for the phi nodes		// These instructions have a non-void type, so account for the phi nodes
// that we will create. This cost is likely to be zero. The phi node		// that we will create. This cost is likely to be zero. The phi node
// cost, if any, should be scaled by the block probability because it		// cost, if any, should be scaled by the block probability because it
// models a copy at the end of each predicated block.		// models a copy at the end of each predicated block.
Cost += VF.getKnownMinValue() *		Cost += VF.getKnownMinValue() *
TTI.getCFInstrCost(Instruction::PHI, CostKind);		TTI.getCFInstrCost(Instruction::PHI, CostKind);
▲ Show 20 Lines • Show All 1,156 Lines • ▼ Show 20 Lines	VPRecipeOrVPValueTy VPRecipeBuilder::tryToBlend(PHINode *Phi,
return toVPRecipeResult(new VPBlendRecipe(Phi, OperandsWithMask));		return toVPRecipeResult(new VPBlendRecipe(Phi, OperandsWithMask));
}		}

VPWidenCallRecipe VPRecipeBuilder::tryToWidenCall(CallInst CI,		VPWidenCallRecipe VPRecipeBuilder::tryToWidenCall(CallInst CI,
ArrayRef<VPValue *> Operands,		ArrayRef<VPValue *> Operands,
VFRange &Range) const {		VFRange &Range) const {

bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange(		bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange(
[this, CI](ElementCount VF) { return CM.isScalarWithPredication(CI); },		[this, CI](ElementCount VF) {
		return CM.isScalarWithPredication(CI, VF);
		},
Range);		Range);

if (IsPredicated)		if (IsPredicated)
return nullptr;		return nullptr;

Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
if (ID && (ID == Intrinsic::assume \|\| ID == Intrinsic::lifetime_end \|\|		if (ID && (ID == Intrinsic::assume \|\| ID == Intrinsic::lifetime_end \|\|
ID == Intrinsic::lifetime_start \|\| ID == Intrinsic::sideeffect \|\|		ID == Intrinsic::lifetime_start \|\| ID == Intrinsic::sideeffect \|\|
Show All 23 Lines

bool VPRecipeBuilder::shouldWiden(Instruction *I, VFRange &Range) const {		bool VPRecipeBuilder::shouldWiden(Instruction *I, VFRange &Range) const {
assert(!isa<BranchInst>(I) && !isa<PHINode>(I) && !isa<LoadInst>(I) &&		assert(!isa<BranchInst>(I) && !isa<PHINode>(I) && !isa<LoadInst>(I) &&
!isa<StoreInst>(I) && "Instruction should have been handled earlier");		!isa<StoreInst>(I) && "Instruction should have been handled earlier");
// Instruction should be widened, unless it is scalar after vectorization,		// Instruction should be widened, unless it is scalar after vectorization,
// scalarization is profitable or it is predicated.		// scalarization is profitable or it is predicated.
auto WillScalarize = [this, I](ElementCount VF) -> bool {		auto WillScalarize = [this, I](ElementCount VF) -> bool {
return CM.isScalarAfterVectorization(I, VF) \|\|		return CM.isScalarAfterVectorization(I, VF) \|\|
CM.isProfitableToScalarize(I, VF) \|\| CM.isScalarWithPredication(I);		CM.isProfitableToScalarize(I, VF) \|\|
		CM.isScalarWithPredication(I, VF);
};		};
return !LoopVectorizationPlanner::getDecisionAndClampRange(WillScalarize,		return !LoopVectorizationPlanner::getDecisionAndClampRange(WillScalarize,
Range);		Range);
}		}

VPWidenRecipe VPRecipeBuilder::tryToWiden(Instruction I,		VPWidenRecipe VPRecipeBuilder::tryToWiden(Instruction I,
ArrayRef<VPValue *> Operands) const {		ArrayRef<VPValue *> Operands) const {
auto IsVectorizableOpcode = [](unsigned Opcode) {		auto IsVectorizableOpcode = [](unsigned Opcode) {
▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
VPBasicBlock *VPRecipeBuilder::handleReplication(		VPBasicBlock *VPRecipeBuilder::handleReplication(
Instruction I, VFRange &Range, VPBasicBlock VPBB,		Instruction I, VFRange &Range, VPBasicBlock VPBB,
VPlanPtr &Plan) {		VPlanPtr &Plan) {
bool IsUniform = LoopVectorizationPlanner::getDecisionAndClampRange(		bool IsUniform = LoopVectorizationPlanner::getDecisionAndClampRange(
[&](ElementCount VF) { return CM.isUniformAfterVectorization(I, VF); },		[&](ElementCount VF) { return CM.isUniformAfterVectorization(I, VF); },
Range);		Range);

bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange(		bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange(
[&](ElementCount VF) { return CM.isPredicatedInst(I, IsUniform); },		[&](ElementCount VF) { return CM.isPredicatedInst(I, VF, IsUniform); },
Range);		Range);

// Even if the instruction is not marked as uniform, there are certain		// Even if the instruction is not marked as uniform, there are certain
// intrinsic calls that can be effectively treated as such, so we check for		// intrinsic calls that can be effectively treated as such, so we check for
// them here. Conservatively, we only do this for scalable vectors, since		// them here. Conservatively, we only do this for scalable vectors, since
// for fixed-width VFs we can always fall back on full scalarization.		// for fixed-width VFs we can always fall back on full scalarization.
if (!IsUniform && Range.Start.isScalable() && isa<IntrinsicInst>(I)) {		if (!IsUniform && Range.Start.isScalable() && isa<IntrinsicInst>(I)) {
switch (cast<IntrinsicInst>(I)->getIntrinsicID()) {		switch (cast<IntrinsicInst>(I)->getIntrinsicID()) {
▲ Show 20 Lines • Show All 2,035 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll

Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	if.end: ; preds = %if.then, %for.body
%index.next = add nuw i64 %index, 1		%index.next = add nuw i64 %index, 1
%exitcond.not = icmp eq i64 %index.next, %n		%exitcond.not = icmp eq i64 %index.next, %n
br i1 %exitcond.not, label %for.end, label %for.body		br i1 %exitcond.not, label %for.end, label %for.body

for.end: ; preds = %for.inc, %entry		for.end: ; preds = %for.inc, %entry
ret void		ret void
}		}

attributes #0 = { "target-features"="+neon,+sve,+v8.1a" }		attributes #0 = { "target-features"="+neon,+sve,+v8.1a" vscale_range(2, 0) }

llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse-mask4.ll

Show First 20 Lines • Show All 119 Lines • ▼ Show 20 Lines	if.then: ; preds = %for.body
store double %add, double* %arrayidx1, align 8		store double %add, double* %arrayidx1, align 8
br label %for.inc		br label %for.inc

for.inc: ; preds = %for.body, %if.then		for.inc: ; preds = %for.body, %if.then
%cmp = icmp sgt i64 %i.08.in, 1		%cmp = icmp sgt i64 %i.08.in, 1
br i1 %cmp, label %for.body, label %for.cond.cleanup, !llvm.loop !0		br i1 %cmp, label %for.body, label %for.cond.cleanup, !llvm.loop !0
}		}

attributes #0 = {"target-cpu"="generic" "target-features"="+neon,+sve"}		attributes #0 = {"target-cpu"="generic" "target-features"="+neon,+sve" vscale_range(2,0) }


!0 = distinct !{!0, !1, !2, !3, !4, !5}		!0 = distinct !{!0, !1, !2, !3, !4, !5}
!1 = !{!"llvm.loop.mustprogress"}		!1 = !{!"llvm.loop.mustprogress"}
!2 = !{!"llvm.loop.vectorize.width", i32 4}		!2 = !{!"llvm.loop.vectorize.width", i32 4}
!3 = !{!"llvm.loop.vectorize.scalable.enable", i1 false}		!3 = !{!"llvm.loop.vectorize.scalable.enable", i1 false}
!4 = !{!"llvm.loop.vectorize.enable", i1 true}		!4 = !{!"llvm.loop.vectorize.enable", i1 true}
!5 = !{!"llvm.loop.interleave.count", i32 2}		!5 = !{!"llvm.loop.interleave.count", i32 2}

llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+sse2 \| FileCheck %s --check-prefixes=SSE			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+sse2 \| FileCheck %s --check-prefixes=SSE
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx \| FileCheck %s --check-prefixes=AVX			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx \| FileCheck %s --check-prefixes=AVX
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx2 \| FileCheck %s --check-prefixes=AVX			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx2 \| FileCheck %s --check-prefixes=AVX
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512f \| FileCheck %s --check-prefixes=AVX			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512f \| FileCheck %s --check-prefixes=AVX512
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512vl \| FileCheck %s --check-prefixes=AVX			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512vl \| FileCheck %s --check-prefixes=AVX512


	@b = global [8 x i32] zeroinitializer, align 16			@b = global [8 x i32] zeroinitializer, align 16
	@a = global [8 x i32] zeroinitializer, align 16			@a = global [8 x i32] zeroinitializer, align 16

	define void @foo() {			define void @foo() {
	; SSE-LABEL: @foo(			; SSE-LABEL: @foo(
	; SSE-NEXT: [[TMP1:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 0), align 16			; SSE-NEXT: [[TMP1:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 0), align 16
	Show All 12 Lines
	; AVX-NEXT: [[TMP1:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 0), align 16			; AVX-NEXT: [[TMP1:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 0), align 16
	; AVX-NEXT: [[TMP2:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 2), align 8			; AVX-NEXT: [[TMP2:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 2), align 8
	; AVX-NEXT: [[TMP3:%.*]] = insertelement <8 x i32> poison, i32 [[TMP1]], i64 0			; AVX-NEXT: [[TMP3:%.*]] = insertelement <8 x i32> poison, i32 [[TMP1]], i64 0
	; AVX-NEXT: [[TMP4:%.*]] = insertelement <8 x i32> [[TMP3]], i32 [[TMP2]], i64 1			; AVX-NEXT: [[TMP4:%.*]] = insertelement <8 x i32> [[TMP3]], i32 [[TMP2]], i64 1
	; AVX-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x i32> [[TMP4]], <8 x i32> poison, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>			; AVX-NEXT: [[SHUFFLE:%.*]] = shufflevector <8 x i32> [[TMP4]], <8 x i32> poison, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>
	; AVX-NEXT: store <8 x i32> [[SHUFFLE]], <8 x i32>* bitcast ([8 x i32]* @a to <8 x i32>*), align 16			; AVX-NEXT: store <8 x i32> [[SHUFFLE]], <8 x i32>* bitcast ([8 x i32]* @a to <8 x i32>*), align 16
	; AVX-NEXT: ret void			; AVX-NEXT: ret void
	;			;
				; AVX512-LABEL: @foo(
				; AVX512-NEXT: [[TMP1:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> <i32* getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 0), i32* getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 2)>, i32 8, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; AVX512-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>
				; AVX512-NEXT: store <8 x i32> [[SHUFFLE]], <8 x i32>* bitcast ([8 x i32]* @a to <8 x i32>*), align 16
				; AVX512-NEXT: ret void
				;
	%1 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 0), align 16			%1 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 0), align 16
	store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 0), align 16			store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 0), align 16
	%2 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 2), align 8			%2 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 2), align 8
	store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 1), align 4			store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 1), align 4
	store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 2), align 8			store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 2), align 8
	store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 3), align 4			store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 3), align 4
	store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 4), align 16			store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 4), align 16
	store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 5), align 4			store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 5), align 4
	store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 6), align 8			store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 6), align 8
	store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 7), align 4			store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 7), align 4
	ret void			ret void
	}			}

llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll

	Show First 20 Lines • Show All 99 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP15:%.*]] = add nsw <4 x i32> [[TMP14]], <i32 1, i32 2, i32 3, i32 4>			; AVX2-NEXT: [[TMP15:%.*]] = add nsw <4 x i32> [[TMP14]], <i32 1, i32 2, i32 3, i32 4>
	; AVX2-NEXT: [[TMP16:%.]] = bitcast i32 [[TMP0:%.]] to <4 x i32>			; AVX2-NEXT: [[TMP16:%.]] = bitcast i32 [[TMP0:%.]] to <4 x i32>
	; AVX2-NEXT: store <4 x i32> [[TMP15]], <4 x i32>* [[TMP16]], align 4, !tbaa [[TBAA0]]			; AVX2-NEXT: store <4 x i32> [[TMP15]], <4 x i32>* [[TMP16]], align 4, !tbaa [[TBAA0]]
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512F-LABEL: @gather_load_2(			; AVX512F-LABEL: @gather_load_2(
	; AVX512F-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TMP1:%.*]], i64 1			; AVX512F-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TMP1:%.*]], i64 1
	; AVX512F-NEXT: [[TMP4:%.]] = load i32, i32 [[TMP3]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP4:%.]] = load i32, i32 [[TMP3]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 10			; AVX512F-NEXT: [[TMP5:%.*]] = add nsw i32 [[TMP4]], 1
	; AVX512F-NEXT: [[TMP6:%.]] = load i32, i32 [[TMP5]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 1
	; AVX512F-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 3			; AVX512F-NEXT: store i32 [[TMP5]], i32* [[TMP0]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 10
	; AVX512F-NEXT: [[TMP8:%.]] = load i32, i32 [[TMP7]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP8:%.]] = load i32, i32 [[TMP7]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 5			; AVX512F-NEXT: [[TMP9:%.*]] = add nsw i32 [[TMP8]], 2
	; AVX512F-NEXT: [[TMP10:%.]] = load i32, i32 [[TMP9]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 2
	; AVX512F-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> poison, i32 [[TMP4]], i64 0			; AVX512F-NEXT: store i32 [[TMP9]], i32* [[TMP6]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[TMP6]], i64 1			; AVX512F-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 3
	; AVX512F-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP12]], i32 [[TMP8]], i64 2			; AVX512F-NEXT: [[TMP12:%.]] = load i32, i32 [[TMP11]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP14:%.*]] = insertelement <4 x i32> [[TMP13]], i32 [[TMP10]], i64 3			; AVX512F-NEXT: [[TMP13:%.*]] = add nsw i32 [[TMP12]], 3
	; AVX512F-NEXT: [[TMP15:%.*]] = add nsw <4 x i32> [[TMP14]], <i32 1, i32 2, i32 3, i32 4>			; AVX512F-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 3
	; AVX512F-NEXT: [[TMP16:%.]] = bitcast i32 [[TMP0:%.]] to <4 x i32>			; AVX512F-NEXT: store i32 [[TMP13]], i32* [[TMP10]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: store <4 x i32> [[TMP15]], <4 x i32>* [[TMP16]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP15:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 5
				; AVX512F-NEXT: [[TMP16:%.]] = load i32, i32 [[TMP15]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: [[TMP17:%.*]] = add nsw i32 [[TMP16]], 4
				; AVX512F-NEXT: store i32 [[TMP17]], i32* [[TMP14]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: ret void			; AVX512F-NEXT: ret void
	;			;
	; AVX512VL-LABEL: @gather_load_2(			; AVX512VL-LABEL: @gather_load_2(
	; AVX512VL-NEXT: [[TMP3:%.]] = insertelement <4 x i32> poison, i32* [[TMP1:%.*]], i64 0			; AVX512VL-NEXT: [[TMP3:%.]] = insertelement <4 x i32> poison, i32* [[TMP1:%.*]], i64 0
	; AVX512VL-NEXT: [[SHUFFLE:%.]] = shufflevector <4 x i32> [[TMP3]], <4 x i32*> poison, <4 x i32> zeroinitializer			; AVX512VL-NEXT: [[SHUFFLE:%.]] = shufflevector <4 x i32> [[TMP3]], <4 x i32*> poison, <4 x i32> zeroinitializer
	; AVX512VL-NEXT: [[TMP4:%.]] = getelementptr i32, <4 x i32> [[SHUFFLE]], <4 x i64> <i64 1, i64 10, i64 3, i64 5>			; AVX512VL-NEXT: [[TMP4:%.]] = getelementptr i32, <4 x i32> [[SHUFFLE]], <4 x i64> <i64 1, i64 10, i64 3, i64 5>
	; AVX512VL-NEXT: [[TMP5:%.]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32> [[TMP4]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef), !tbaa [[TBAA0]]			; AVX512VL-NEXT: [[TMP5:%.]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32> [[TMP4]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef), !tbaa [[TBAA0]]
	; AVX512VL-NEXT: [[TMP6:%.*]] = add nsw <4 x i32> [[TMP5]], <i32 1, i32 2, i32 3, i32 4>			; AVX512VL-NEXT: [[TMP6:%.*]] = add nsw <4 x i32> [[TMP5]], <i32 1, i32 2, i32 3, i32 4>
	▲ Show 20 Lines • Show All 121 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP25:%.*]] = insertelement <8 x i32> [[TMP24]], i32 [[TMP17]], i64 7			; AVX2-NEXT: [[TMP25:%.*]] = insertelement <8 x i32> [[TMP24]], i32 [[TMP17]], i64 7
	; AVX2-NEXT: [[TMP26:%.*]] = add <8 x i32> [[TMP25]], <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>			; AVX2-NEXT: [[TMP26:%.*]] = add <8 x i32> [[TMP25]], <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>
	; AVX2-NEXT: [[TMP27:%.]] = bitcast i32 [[TMP0:%.]] to <8 x i32>			; AVX2-NEXT: [[TMP27:%.]] = bitcast i32 [[TMP0:%.]] to <8 x i32>
	; AVX2-NEXT: store <8 x i32> [[TMP26]], <8 x i32>* [[TMP27]], align 4, !tbaa [[TBAA0]]			; AVX2-NEXT: store <8 x i32> [[TMP26]], <8 x i32>* [[TMP27]], align 4, !tbaa [[TBAA0]]
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512F-LABEL: @gather_load_3(			; AVX512F-LABEL: @gather_load_3(
	; AVX512F-NEXT: [[TMP3:%.]] = load i32, i32 [[TMP1:%.*]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP3:%.]] = load i32, i32 [[TMP1:%.*]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 11			; AVX512F-NEXT: [[TMP4:%.*]] = add i32 [[TMP3]], 1
	; AVX512F-NEXT: [[TMP5:%.]] = load i32, i32 [[TMP4]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 1
	; AVX512F-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 4			; AVX512F-NEXT: store i32 [[TMP4]], i32* [[TMP0]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 11
	; AVX512F-NEXT: [[TMP7:%.]] = load i32, i32 [[TMP6]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP7:%.]] = load i32, i32 [[TMP6]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 15			; AVX512F-NEXT: [[TMP8:%.*]] = add i32 [[TMP7]], 2
	; AVX512F-NEXT: [[TMP9:%.]] = load i32, i32 [[TMP8]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 2
	; AVX512F-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> poison, i32 [[TMP3]], i64 0			; AVX512F-NEXT: store i32 [[TMP8]], i32* [[TMP5]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[TMP5]], i64 1			; AVX512F-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 4
	; AVX512F-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[TMP7]], i64 2			; AVX512F-NEXT: [[TMP11:%.]] = load i32, i32 [[TMP10]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP12]], i32 [[TMP9]], i64 3			; AVX512F-NEXT: [[TMP12:%.*]] = add i32 [[TMP11]], 3
	; AVX512F-NEXT: [[TMP14:%.*]] = add <4 x i32> [[TMP13]], <i32 1, i32 2, i32 3, i32 4>			; AVX512F-NEXT: [[TMP13:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 3
	; AVX512F-NEXT: [[TMP15:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 4			; AVX512F-NEXT: store i32 [[TMP12]], i32* [[TMP9]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP16:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; AVX512F-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 15
	; AVX512F-NEXT: store <4 x i32> [[TMP14]], <4 x i32>* [[TMP16]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP15:%.]] = load i32, i32 [[TMP14]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 18			; AVX512F-NEXT: [[TMP16:%.*]] = add i32 [[TMP15]], 4
	; AVX512F-NEXT: [[TMP18:%.]] = load i32, i32 [[TMP17]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 4
	; AVX512F-NEXT: [[TMP19:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 9			; AVX512F-NEXT: store i32 [[TMP16]], i32* [[TMP13]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP20:%.]] = load i32, i32 [[TMP19]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP18:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 18
	; AVX512F-NEXT: [[TMP21:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 6			; AVX512F-NEXT: [[TMP19:%.]] = load i32, i32 [[TMP18]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP22:%.]] = load i32, i32 [[TMP21]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP20:%.*]] = add i32 [[TMP19]], 1
	; AVX512F-NEXT: [[TMP23:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 21			; AVX512F-NEXT: [[TMP21:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 5
	; AVX512F-NEXT: [[TMP24:%.]] = load i32, i32 [[TMP23]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: store i32 [[TMP20]], i32* [[TMP17]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP25:%.*]] = insertelement <4 x i32> poison, i32 [[TMP18]], i64 0			; AVX512F-NEXT: [[TMP22:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 9
	; AVX512F-NEXT: [[TMP26:%.*]] = insertelement <4 x i32> [[TMP25]], i32 [[TMP20]], i64 1			; AVX512F-NEXT: [[TMP23:%.]] = load i32, i32 [[TMP22]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP27:%.*]] = insertelement <4 x i32> [[TMP26]], i32 [[TMP22]], i64 2			; AVX512F-NEXT: [[TMP24:%.*]] = add i32 [[TMP23]], 2
	; AVX512F-NEXT: [[TMP28:%.*]] = insertelement <4 x i32> [[TMP27]], i32 [[TMP24]], i64 3			; AVX512F-NEXT: [[TMP25:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 6
	; AVX512F-NEXT: [[TMP29:%.*]] = add <4 x i32> [[TMP28]], <i32 1, i32 2, i32 3, i32 4>			; AVX512F-NEXT: store i32 [[TMP24]], i32* [[TMP21]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP30:%.]] = bitcast i32 [[TMP15]] to <4 x i32>*			; AVX512F-NEXT: [[TMP26:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 6
	; AVX512F-NEXT: store <4 x i32> [[TMP29]], <4 x i32>* [[TMP30]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP27:%.]] = load i32, i32 [[TMP26]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: [[TMP28:%.*]] = add i32 [[TMP27]], 3
				; AVX512F-NEXT: [[TMP29:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 7
				; AVX512F-NEXT: store i32 [[TMP28]], i32* [[TMP25]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: [[TMP30:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 21
				; AVX512F-NEXT: [[TMP31:%.]] = load i32, i32 [[TMP30]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: [[TMP32:%.*]] = add i32 [[TMP31]], 4
				; AVX512F-NEXT: store i32 [[TMP32]], i32* [[TMP29]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: ret void			; AVX512F-NEXT: ret void
	;			;
	; AVX512VL-LABEL: @gather_load_3(			; AVX512VL-LABEL: @gather_load_3(
	; AVX512VL-NEXT: [[TMP3:%.]] = load i32, i32 [[TMP1:%.*]], align 4, !tbaa [[TBAA0]]			; AVX512VL-NEXT: [[TMP3:%.]] = load i32, i32 [[TMP1:%.*]], align 4, !tbaa [[TBAA0]]
	; AVX512VL-NEXT: [[TMP4:%.*]] = add i32 [[TMP3]], 1			; AVX512VL-NEXT: [[TMP4:%.*]] = add i32 [[TMP3]], 1
	; AVX512VL-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 1			; AVX512VL-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 1
	; AVX512VL-NEXT: store i32 [[TMP4]], i32* [[TMP0]], align 4, !tbaa [[TBAA0]]			; AVX512VL-NEXT: store i32 [[TMP4]], i32* [[TMP0]], align 4, !tbaa [[TBAA0]]
	; AVX512VL-NEXT: [[TMP6:%.]] = insertelement <4 x i32> poison, i32* [[TMP1]], i64 0			; AVX512VL-NEXT: [[TMP6:%.]] = insertelement <4 x i32> poison, i32* [[TMP1]], i64 0
	▲ Show 20 Lines • Show All 157 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP7:%.*]] = insertelement <8 x i32> [[TMP6]], i32 [[T27]], i64 6			; AVX2-NEXT: [[TMP7:%.*]] = insertelement <8 x i32> [[TMP6]], i32 [[T27]], i64 6
	; AVX2-NEXT: [[TMP8:%.*]] = insertelement <8 x i32> [[TMP7]], i32 [[T31]], i64 7			; AVX2-NEXT: [[TMP8:%.*]] = insertelement <8 x i32> [[TMP7]], i32 [[T31]], i64 7
	; AVX2-NEXT: [[TMP9:%.*]] = add <8 x i32> [[TMP8]], <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>			; AVX2-NEXT: [[TMP9:%.*]] = add <8 x i32> [[TMP8]], <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>
	; AVX2-NEXT: [[TMP10:%.]] = bitcast i32 [[T0:%.]] to <8 x i32>			; AVX2-NEXT: [[TMP10:%.]] = bitcast i32 [[T0:%.]] to <8 x i32>
	; AVX2-NEXT: store <8 x i32> [[TMP9]], <8 x i32>* [[TMP10]], align 4, !tbaa [[TBAA0]]			; AVX2-NEXT: store <8 x i32> [[TMP9]], <8 x i32>* [[TMP10]], align 4, !tbaa [[TBAA0]]
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512F-LABEL: @gather_load_4(			; AVX512F-LABEL: @gather_load_4(
				; AVX512F-NEXT: [[T5:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 1
	; AVX512F-NEXT: [[T6:%.]] = getelementptr inbounds i32, i32 [[T1:%.*]], i64 11			; AVX512F-NEXT: [[T6:%.]] = getelementptr inbounds i32, i32 [[T1:%.*]], i64 11
				; AVX512F-NEXT: [[T9:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 2
	; AVX512F-NEXT: [[T10:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 4			; AVX512F-NEXT: [[T10:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 4
				; AVX512F-NEXT: [[T13:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 3
	; AVX512F-NEXT: [[T14:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 15			; AVX512F-NEXT: [[T14:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 15
	; AVX512F-NEXT: [[T17:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 4			; AVX512F-NEXT: [[T17:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 4
	; AVX512F-NEXT: [[T18:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 18			; AVX512F-NEXT: [[T18:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 18
				; AVX512F-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5
	; AVX512F-NEXT: [[T22:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 9			; AVX512F-NEXT: [[T22:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 9
				; AVX512F-NEXT: [[T25:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 6
	; AVX512F-NEXT: [[T26:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 6			; AVX512F-NEXT: [[T26:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 6
				; AVX512F-NEXT: [[T29:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 7
	; AVX512F-NEXT: [[T30:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 21			; AVX512F-NEXT: [[T30:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 21
	; AVX512F-NEXT: [[T3:%.]] = load i32, i32 [[T1]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T3:%.]] = load i32, i32 [[T1]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[T7:%.]] = load i32, i32 [[T6]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T7:%.]] = load i32, i32 [[T6]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[T11:%.]] = load i32, i32 [[T10]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T11:%.]] = load i32, i32 [[T10]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[T15:%.]] = load i32, i32 [[T14]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T15:%.]] = load i32, i32 [[T14]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[T19:%.]] = load i32, i32 [[T18]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T19:%.]] = load i32, i32 [[T18]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[T23:%.]] = load i32, i32 [[T22]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T23:%.]] = load i32, i32 [[T22]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[T27:%.]] = load i32, i32 [[T26]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T27:%.]] = load i32, i32 [[T26]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[T31:%.]] = load i32, i32 [[T30]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T31:%.]] = load i32, i32 [[T30]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP1:%.*]] = insertelement <4 x i32> poison, i32 [[T3]], i64 0			; AVX512F-NEXT: [[T4:%.*]] = add i32 [[T3]], 1
	; AVX512F-NEXT: [[TMP2:%.*]] = insertelement <4 x i32> [[TMP1]], i32 [[T7]], i64 1			; AVX512F-NEXT: [[T8:%.*]] = add i32 [[T7]], 2
	; AVX512F-NEXT: [[TMP3:%.*]] = insertelement <4 x i32> [[TMP2]], i32 [[T11]], i64 2			; AVX512F-NEXT: [[T12:%.*]] = add i32 [[T11]], 3
	; AVX512F-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> [[TMP3]], i32 [[T15]], i64 3			; AVX512F-NEXT: [[T16:%.*]] = add i32 [[T15]], 4
	; AVX512F-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], <i32 1, i32 2, i32 3, i32 4>			; AVX512F-NEXT: [[T20:%.*]] = add i32 [[T19]], 1
	; AVX512F-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> poison, i32 [[T19]], i64 0			; AVX512F-NEXT: [[T24:%.*]] = add i32 [[T23]], 2
	; AVX512F-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[T23]], i64 1			; AVX512F-NEXT: [[T28:%.*]] = add i32 [[T27]], 3
	; AVX512F-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[T27]], i64 2			; AVX512F-NEXT: [[T32:%.*]] = add i32 [[T31]], 4
	; AVX512F-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> [[TMP8]], i32 [[T31]], i64 3			; AVX512F-NEXT: store i32 [[T4]], i32* [[T0]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP10:%.*]] = add <4 x i32> [[TMP9]], <i32 1, i32 2, i32 3, i32 4>			; AVX512F-NEXT: store i32 [[T8]], i32* [[T5]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP11:%.]] = bitcast i32 [[T0]] to <4 x i32>*			; AVX512F-NEXT: store i32 [[T12]], i32* [[T9]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP11]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: store i32 [[T16]], i32* [[T13]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP12:%.]] = bitcast i32 [[T17]] to <4 x i32>*			; AVX512F-NEXT: store i32 [[T20]], i32* [[T17]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: store <4 x i32> [[TMP10]], <4 x i32>* [[TMP12]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: store i32 [[T24]], i32* [[T21]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: store i32 [[T28]], i32* [[T25]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: store i32 [[T32]], i32* [[T29]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: ret void			; AVX512F-NEXT: ret void
	;			;
	; AVX512VL-LABEL: @gather_load_4(			; AVX512VL-LABEL: @gather_load_4(
	; AVX512VL-NEXT: [[T5:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 1			; AVX512VL-NEXT: [[T5:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 1
	; AVX512VL-NEXT: [[TMP1:%.]] = insertelement <4 x i32> poison, i32* [[T1:%.*]], i64 0			; AVX512VL-NEXT: [[TMP1:%.]] = insertelement <4 x i32> poison, i32* [[T1:%.*]], i64 0
	; AVX512VL-NEXT: [[SHUFFLE:%.]] = shufflevector <4 x i32> [[TMP1]], <4 x i32*> poison, <4 x i32> zeroinitializer			; AVX512VL-NEXT: [[SHUFFLE:%.]] = shufflevector <4 x i32> [[TMP1]], <4 x i32*> poison, <4 x i32> zeroinitializer
	; AVX512VL-NEXT: [[TMP2:%.]] = getelementptr i32, <4 x i32> [[SHUFFLE]], <4 x i64> <i64 11, i64 4, i64 15, i64 18>			; AVX512VL-NEXT: [[TMP2:%.]] = getelementptr i32, <4 x i32> [[SHUFFLE]], <4 x i64> <i64 11, i64 4, i64 15, i64 18>
	; AVX512VL-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5			; AVX512VL-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5
	▲ Show 20 Lines • Show All 340 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll

	Show First 20 Lines • Show All 99 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP15:%.*]] = add nsw <4 x i32> [[TMP14]], <i32 1, i32 2, i32 3, i32 4>			; AVX2-NEXT: [[TMP15:%.*]] = add nsw <4 x i32> [[TMP14]], <i32 1, i32 2, i32 3, i32 4>
	; AVX2-NEXT: [[TMP16:%.]] = bitcast i32 [[TMP0:%.]] to <4 x i32>			; AVX2-NEXT: [[TMP16:%.]] = bitcast i32 [[TMP0:%.]] to <4 x i32>
	; AVX2-NEXT: store <4 x i32> [[TMP15]], <4 x i32>* [[TMP16]], align 4, !tbaa [[TBAA0]]			; AVX2-NEXT: store <4 x i32> [[TMP15]], <4 x i32>* [[TMP16]], align 4, !tbaa [[TBAA0]]
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512F-LABEL: @gather_load_2(			; AVX512F-LABEL: @gather_load_2(
	; AVX512F-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TMP1:%.*]], i64 1			; AVX512F-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TMP1:%.*]], i64 1
	; AVX512F-NEXT: [[TMP4:%.]] = load i32, i32 [[TMP3]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP4:%.]] = load i32, i32 [[TMP3]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 10			; AVX512F-NEXT: [[TMP5:%.*]] = add nsw i32 [[TMP4]], 1
	; AVX512F-NEXT: [[TMP6:%.]] = load i32, i32 [[TMP5]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 1
	; AVX512F-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 3			; AVX512F-NEXT: store i32 [[TMP5]], i32* [[TMP0]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 10
	; AVX512F-NEXT: [[TMP8:%.]] = load i32, i32 [[TMP7]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP8:%.]] = load i32, i32 [[TMP7]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 5			; AVX512F-NEXT: [[TMP9:%.*]] = add nsw i32 [[TMP8]], 2
	; AVX512F-NEXT: [[TMP10:%.]] = load i32, i32 [[TMP9]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 2
	; AVX512F-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> poison, i32 [[TMP4]], i64 0			; AVX512F-NEXT: store i32 [[TMP9]], i32* [[TMP6]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[TMP6]], i64 1			; AVX512F-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 3
	; AVX512F-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP12]], i32 [[TMP8]], i64 2			; AVX512F-NEXT: [[TMP12:%.]] = load i32, i32 [[TMP11]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP14:%.*]] = insertelement <4 x i32> [[TMP13]], i32 [[TMP10]], i64 3			; AVX512F-NEXT: [[TMP13:%.*]] = add nsw i32 [[TMP12]], 3
	; AVX512F-NEXT: [[TMP15:%.*]] = add nsw <4 x i32> [[TMP14]], <i32 1, i32 2, i32 3, i32 4>			; AVX512F-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 3
	; AVX512F-NEXT: [[TMP16:%.]] = bitcast i32 [[TMP0:%.]] to <4 x i32>			; AVX512F-NEXT: store i32 [[TMP13]], i32* [[TMP10]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: store <4 x i32> [[TMP15]], <4 x i32>* [[TMP16]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP15:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 5
				; AVX512F-NEXT: [[TMP16:%.]] = load i32, i32 [[TMP15]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: [[TMP17:%.*]] = add nsw i32 [[TMP16]], 4
				; AVX512F-NEXT: store i32 [[TMP17]], i32* [[TMP14]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: ret void			; AVX512F-NEXT: ret void
	;			;
	; AVX512VL-LABEL: @gather_load_2(			; AVX512VL-LABEL: @gather_load_2(
	; AVX512VL-NEXT: [[TMP3:%.]] = insertelement <4 x i32> poison, i32* [[TMP1:%.*]], i64 0			; AVX512VL-NEXT: [[TMP3:%.]] = insertelement <4 x i32> poison, i32* [[TMP1:%.*]], i64 0
	; AVX512VL-NEXT: [[SHUFFLE:%.]] = shufflevector <4 x i32> [[TMP3]], <4 x i32*> poison, <4 x i32> zeroinitializer			; AVX512VL-NEXT: [[SHUFFLE:%.]] = shufflevector <4 x i32> [[TMP3]], <4 x i32*> poison, <4 x i32> zeroinitializer
	; AVX512VL-NEXT: [[TMP4:%.]] = getelementptr i32, <4 x i32> [[SHUFFLE]], <4 x i64> <i64 1, i64 10, i64 3, i64 5>			; AVX512VL-NEXT: [[TMP4:%.]] = getelementptr i32, <4 x i32> [[SHUFFLE]], <4 x i64> <i64 1, i64 10, i64 3, i64 5>
	; AVX512VL-NEXT: [[TMP5:%.]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32> [[TMP4]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef), !tbaa [[TBAA0]]			; AVX512VL-NEXT: [[TMP5:%.]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32> [[TMP4]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef), !tbaa [[TBAA0]]
	; AVX512VL-NEXT: [[TMP6:%.*]] = add nsw <4 x i32> [[TMP5]], <i32 1, i32 2, i32 3, i32 4>			; AVX512VL-NEXT: [[TMP6:%.*]] = add nsw <4 x i32> [[TMP5]], <i32 1, i32 2, i32 3, i32 4>
	▲ Show 20 Lines • Show All 121 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP25:%.*]] = insertelement <8 x i32> [[TMP24]], i32 [[TMP17]], i64 7			; AVX2-NEXT: [[TMP25:%.*]] = insertelement <8 x i32> [[TMP24]], i32 [[TMP17]], i64 7
	; AVX2-NEXT: [[TMP26:%.*]] = add <8 x i32> [[TMP25]], <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>			; AVX2-NEXT: [[TMP26:%.*]] = add <8 x i32> [[TMP25]], <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>
	; AVX2-NEXT: [[TMP27:%.]] = bitcast i32 [[TMP0:%.]] to <8 x i32>			; AVX2-NEXT: [[TMP27:%.]] = bitcast i32 [[TMP0:%.]] to <8 x i32>
	; AVX2-NEXT: store <8 x i32> [[TMP26]], <8 x i32>* [[TMP27]], align 4, !tbaa [[TBAA0]]			; AVX2-NEXT: store <8 x i32> [[TMP26]], <8 x i32>* [[TMP27]], align 4, !tbaa [[TBAA0]]
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512F-LABEL: @gather_load_3(			; AVX512F-LABEL: @gather_load_3(
	; AVX512F-NEXT: [[TMP3:%.]] = load i32, i32 [[TMP1:%.*]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP3:%.]] = load i32, i32 [[TMP1:%.*]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 11			; AVX512F-NEXT: [[TMP4:%.*]] = add i32 [[TMP3]], 1
	; AVX512F-NEXT: [[TMP5:%.]] = load i32, i32 [[TMP4]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 1
	; AVX512F-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 4			; AVX512F-NEXT: store i32 [[TMP4]], i32* [[TMP0]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 11
	; AVX512F-NEXT: [[TMP7:%.]] = load i32, i32 [[TMP6]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP7:%.]] = load i32, i32 [[TMP6]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 15			; AVX512F-NEXT: [[TMP8:%.*]] = add i32 [[TMP7]], 2
	; AVX512F-NEXT: [[TMP9:%.]] = load i32, i32 [[TMP8]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 2
	; AVX512F-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> poison, i32 [[TMP3]], i64 0			; AVX512F-NEXT: store i32 [[TMP8]], i32* [[TMP5]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[TMP5]], i64 1			; AVX512F-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 4
	; AVX512F-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[TMP7]], i64 2			; AVX512F-NEXT: [[TMP11:%.]] = load i32, i32 [[TMP10]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP12]], i32 [[TMP9]], i64 3			; AVX512F-NEXT: [[TMP12:%.*]] = add i32 [[TMP11]], 3
	; AVX512F-NEXT: [[TMP14:%.*]] = add <4 x i32> [[TMP13]], <i32 1, i32 2, i32 3, i32 4>			; AVX512F-NEXT: [[TMP13:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 3
	; AVX512F-NEXT: [[TMP15:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 4			; AVX512F-NEXT: store i32 [[TMP12]], i32* [[TMP9]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP16:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*			; AVX512F-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 15
	; AVX512F-NEXT: store <4 x i32> [[TMP14]], <4 x i32>* [[TMP16]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP15:%.]] = load i32, i32 [[TMP14]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 18			; AVX512F-NEXT: [[TMP16:%.*]] = add i32 [[TMP15]], 4
	; AVX512F-NEXT: [[TMP18:%.]] = load i32, i32 [[TMP17]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 4
	; AVX512F-NEXT: [[TMP19:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 9			; AVX512F-NEXT: store i32 [[TMP16]], i32* [[TMP13]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP20:%.]] = load i32, i32 [[TMP19]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP18:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 18
	; AVX512F-NEXT: [[TMP21:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 6			; AVX512F-NEXT: [[TMP19:%.]] = load i32, i32 [[TMP18]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP22:%.]] = load i32, i32 [[TMP21]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP20:%.*]] = add i32 [[TMP19]], 1
	; AVX512F-NEXT: [[TMP23:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 21			; AVX512F-NEXT: [[TMP21:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 5
	; AVX512F-NEXT: [[TMP24:%.]] = load i32, i32 [[TMP23]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: store i32 [[TMP20]], i32* [[TMP17]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP25:%.*]] = insertelement <4 x i32> poison, i32 [[TMP18]], i64 0			; AVX512F-NEXT: [[TMP22:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 9
	; AVX512F-NEXT: [[TMP26:%.*]] = insertelement <4 x i32> [[TMP25]], i32 [[TMP20]], i64 1			; AVX512F-NEXT: [[TMP23:%.]] = load i32, i32 [[TMP22]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP27:%.*]] = insertelement <4 x i32> [[TMP26]], i32 [[TMP22]], i64 2			; AVX512F-NEXT: [[TMP24:%.*]] = add i32 [[TMP23]], 2
	; AVX512F-NEXT: [[TMP28:%.*]] = insertelement <4 x i32> [[TMP27]], i32 [[TMP24]], i64 3			; AVX512F-NEXT: [[TMP25:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 6
	; AVX512F-NEXT: [[TMP29:%.*]] = add <4 x i32> [[TMP28]], <i32 1, i32 2, i32 3, i32 4>			; AVX512F-NEXT: store i32 [[TMP24]], i32* [[TMP21]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP30:%.]] = bitcast i32 [[TMP15]] to <4 x i32>*			; AVX512F-NEXT: [[TMP26:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 6
	; AVX512F-NEXT: store <4 x i32> [[TMP29]], <4 x i32>* [[TMP30]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[TMP27:%.]] = load i32, i32 [[TMP26]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: [[TMP28:%.*]] = add i32 [[TMP27]], 3
				; AVX512F-NEXT: [[TMP29:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 7
				; AVX512F-NEXT: store i32 [[TMP28]], i32* [[TMP25]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: [[TMP30:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 21
				; AVX512F-NEXT: [[TMP31:%.]] = load i32, i32 [[TMP30]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: [[TMP32:%.*]] = add i32 [[TMP31]], 4
				; AVX512F-NEXT: store i32 [[TMP32]], i32* [[TMP29]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: ret void			; AVX512F-NEXT: ret void
	;			;
	; AVX512VL-LABEL: @gather_load_3(			; AVX512VL-LABEL: @gather_load_3(
	; AVX512VL-NEXT: [[TMP3:%.]] = load i32, i32 [[TMP1:%.*]], align 4, !tbaa [[TBAA0]]			; AVX512VL-NEXT: [[TMP3:%.]] = load i32, i32 [[TMP1:%.*]], align 4, !tbaa [[TBAA0]]
	; AVX512VL-NEXT: [[TMP4:%.*]] = add i32 [[TMP3]], 1			; AVX512VL-NEXT: [[TMP4:%.*]] = add i32 [[TMP3]], 1
	; AVX512VL-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 1			; AVX512VL-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 1
	; AVX512VL-NEXT: store i32 [[TMP4]], i32* [[TMP0]], align 4, !tbaa [[TBAA0]]			; AVX512VL-NEXT: store i32 [[TMP4]], i32* [[TMP0]], align 4, !tbaa [[TBAA0]]
	; AVX512VL-NEXT: [[TMP6:%.]] = insertelement <4 x i32> poison, i32* [[TMP1]], i64 0			; AVX512VL-NEXT: [[TMP6:%.]] = insertelement <4 x i32> poison, i32* [[TMP1]], i64 0
	▲ Show 20 Lines • Show All 157 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: [[TMP7:%.*]] = insertelement <8 x i32> [[TMP6]], i32 [[T27]], i64 6			; AVX2-NEXT: [[TMP7:%.*]] = insertelement <8 x i32> [[TMP6]], i32 [[T27]], i64 6
	; AVX2-NEXT: [[TMP8:%.*]] = insertelement <8 x i32> [[TMP7]], i32 [[T31]], i64 7			; AVX2-NEXT: [[TMP8:%.*]] = insertelement <8 x i32> [[TMP7]], i32 [[T31]], i64 7
	; AVX2-NEXT: [[TMP9:%.*]] = add <8 x i32> [[TMP8]], <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>			; AVX2-NEXT: [[TMP9:%.*]] = add <8 x i32> [[TMP8]], <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>
	; AVX2-NEXT: [[TMP10:%.]] = bitcast i32 [[T0:%.]] to <8 x i32>			; AVX2-NEXT: [[TMP10:%.]] = bitcast i32 [[T0:%.]] to <8 x i32>
	; AVX2-NEXT: store <8 x i32> [[TMP9]], <8 x i32>* [[TMP10]], align 4, !tbaa [[TBAA0]]			; AVX2-NEXT: store <8 x i32> [[TMP9]], <8 x i32>* [[TMP10]], align 4, !tbaa [[TBAA0]]
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	;			;
	; AVX512F-LABEL: @gather_load_4(			; AVX512F-LABEL: @gather_load_4(
				; AVX512F-NEXT: [[T5:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 1
	; AVX512F-NEXT: [[T6:%.]] = getelementptr inbounds i32, i32 [[T1:%.*]], i64 11			; AVX512F-NEXT: [[T6:%.]] = getelementptr inbounds i32, i32 [[T1:%.*]], i64 11
				; AVX512F-NEXT: [[T9:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 2
	; AVX512F-NEXT: [[T10:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 4			; AVX512F-NEXT: [[T10:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 4
				; AVX512F-NEXT: [[T13:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 3
	; AVX512F-NEXT: [[T14:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 15			; AVX512F-NEXT: [[T14:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 15
	; AVX512F-NEXT: [[T17:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 4			; AVX512F-NEXT: [[T17:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 4
	; AVX512F-NEXT: [[T18:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 18			; AVX512F-NEXT: [[T18:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 18
				; AVX512F-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5
	; AVX512F-NEXT: [[T22:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 9			; AVX512F-NEXT: [[T22:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 9
				; AVX512F-NEXT: [[T25:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 6
	; AVX512F-NEXT: [[T26:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 6			; AVX512F-NEXT: [[T26:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 6
				; AVX512F-NEXT: [[T29:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 7
	; AVX512F-NEXT: [[T30:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 21			; AVX512F-NEXT: [[T30:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 21
	; AVX512F-NEXT: [[T3:%.]] = load i32, i32 [[T1]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T3:%.]] = load i32, i32 [[T1]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[T7:%.]] = load i32, i32 [[T6]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T7:%.]] = load i32, i32 [[T6]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[T11:%.]] = load i32, i32 [[T10]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T11:%.]] = load i32, i32 [[T10]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[T15:%.]] = load i32, i32 [[T14]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T15:%.]] = load i32, i32 [[T14]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[T19:%.]] = load i32, i32 [[T18]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T19:%.]] = load i32, i32 [[T18]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[T23:%.]] = load i32, i32 [[T22]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T23:%.]] = load i32, i32 [[T22]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[T27:%.]] = load i32, i32 [[T26]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T27:%.]] = load i32, i32 [[T26]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[T31:%.]] = load i32, i32 [[T30]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: [[T31:%.]] = load i32, i32 [[T30]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP1:%.*]] = insertelement <4 x i32> poison, i32 [[T3]], i64 0			; AVX512F-NEXT: [[T4:%.*]] = add i32 [[T3]], 1
	; AVX512F-NEXT: [[TMP2:%.*]] = insertelement <4 x i32> [[TMP1]], i32 [[T7]], i64 1			; AVX512F-NEXT: [[T8:%.*]] = add i32 [[T7]], 2
	; AVX512F-NEXT: [[TMP3:%.*]] = insertelement <4 x i32> [[TMP2]], i32 [[T11]], i64 2			; AVX512F-NEXT: [[T12:%.*]] = add i32 [[T11]], 3
	; AVX512F-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> [[TMP3]], i32 [[T15]], i64 3			; AVX512F-NEXT: [[T16:%.*]] = add i32 [[T15]], 4
	; AVX512F-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], <i32 1, i32 2, i32 3, i32 4>			; AVX512F-NEXT: [[T20:%.*]] = add i32 [[T19]], 1
	; AVX512F-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> poison, i32 [[T19]], i64 0			; AVX512F-NEXT: [[T24:%.*]] = add i32 [[T23]], 2
	; AVX512F-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[T23]], i64 1			; AVX512F-NEXT: [[T28:%.*]] = add i32 [[T27]], 3
	; AVX512F-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[T27]], i64 2			; AVX512F-NEXT: [[T32:%.*]] = add i32 [[T31]], 4
	; AVX512F-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> [[TMP8]], i32 [[T31]], i64 3			; AVX512F-NEXT: store i32 [[T4]], i32* [[T0]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP10:%.*]] = add <4 x i32> [[TMP9]], <i32 1, i32 2, i32 3, i32 4>			; AVX512F-NEXT: store i32 [[T8]], i32* [[T5]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP11:%.]] = bitcast i32 [[T0]] to <4 x i32>*			; AVX512F-NEXT: store i32 [[T12]], i32* [[T9]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP11]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: store i32 [[T16]], i32* [[T13]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: [[TMP12:%.]] = bitcast i32 [[T17]] to <4 x i32>*			; AVX512F-NEXT: store i32 [[T20]], i32* [[T17]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: store <4 x i32> [[TMP10]], <4 x i32>* [[TMP12]], align 4, !tbaa [[TBAA0]]			; AVX512F-NEXT: store i32 [[T24]], i32* [[T21]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: store i32 [[T28]], i32* [[T25]], align 4, !tbaa [[TBAA0]]
				; AVX512F-NEXT: store i32 [[T32]], i32* [[T29]], align 4, !tbaa [[TBAA0]]
	; AVX512F-NEXT: ret void			; AVX512F-NEXT: ret void
	;			;
	; AVX512VL-LABEL: @gather_load_4(			; AVX512VL-LABEL: @gather_load_4(
	; AVX512VL-NEXT: [[T5:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 1			; AVX512VL-NEXT: [[T5:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 1
	; AVX512VL-NEXT: [[TMP1:%.]] = insertelement <4 x i32> poison, i32* [[T1:%.*]], i64 0			; AVX512VL-NEXT: [[TMP1:%.]] = insertelement <4 x i32> poison, i32* [[T1:%.*]], i64 0
	; AVX512VL-NEXT: [[SHUFFLE:%.]] = shufflevector <4 x i32> [[TMP1]], <4 x i32*> poison, <4 x i32> zeroinitializer			; AVX512VL-NEXT: [[SHUFFLE:%.]] = shufflevector <4 x i32> [[TMP1]], <4 x i32*> poison, <4 x i32> zeroinitializer
	; AVX512VL-NEXT: [[TMP2:%.]] = getelementptr i32, <4 x i32> [[SHUFFLE]], <4 x i64> <i64 11, i64 4, i64 15, i64 18>			; AVX512VL-NEXT: [[TMP2:%.]] = getelementptr i32, <4 x i32> [[SHUFFLE]], <4 x i64> <i64 11, i64 4, i64 15, i64 18>
	; AVX512VL-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5			; AVX512VL-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5
	▲ Show 20 Lines • Show All 362 Lines • Show Last 20 Lines