This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
VectorUtils.h
-
lib/
-
Analysis/
-
VectorUtils.cpp
-
Transforms/Vectorize/
-
Vectorize/
-
LoopVectorize.cpp
-
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
lookahead.ll
-
pr47623.ll
-
pr47629.ll

Differential D92701

[SLPVectorize] Call isLegalMaskedGather before creating a gather TreeEntry
Needs ReviewPublic

Authored by Carrot on Dec 4 2020, 3:27 PM.

Download Raw Diff

Details

Reviewers

anton-afanasyev
fhahn
RKSimon

Summary

Function isLegalMaskedGather should be queried when vectorizing a gather/scatter intrinsic/instruction, like LoopVectorize does in function setCostBasedWideningDecision. But SLPVectorize doesn't do this check, so it can even generate gather intrinsics on targets that don't support gather instructions, like x86 processor supports SSE but no AVX512.

Diff Detail

Unit TestsFailed

	Time	Test
	380 ms	x64 debian > AddressSanitizer-x86_64-linux-dynamic.TestCases/Linux::odr-vtable.cpp
	380 ms	x64 debian > AddressSanitizer-x86_64-linux.TestCases/Linux::odr-vtable.cpp

Event Timeline

Carrot created this revision.Dec 4 2020, 3:27 PM

Herald added subscribers: pengfei, hiraditya. · View Herald TranscriptDec 4 2020, 3:27 PM

Carrot requested review of this revision.Dec 4 2020, 3:27 PM

Herald added a project: Restricted Project. · View Herald TranscriptDec 4 2020, 3:27 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Carrot added a reviewer: anton-afanasyev.Dec 4 2020, 3:31 PM

Carrot added a reviewer: fhahn.

Harbormaster completed remote builds in B81184: Diff 309677.Dec 4 2020, 5:56 PM

I think using isLegalMaskedGather is a workaround for a cost-model deficiencies. If the cost-model accurately estimates the cost of masked gathers & scatters, they would not be selected by the SLP vectorizer. I think it would be preferable to improve the cost modeling for gathers/scatters on X86 and remove the workaround in LoopVectorize as well. I recently added an initial estimate for gather & scatter costs to BasicTTI (D91984) and this resolved similar issues on AArch64. Would it be possible to adjust X86TTIImpl::getGatherScatterOpCost to more accurately estimate the cost, maybe by falling back to BasicTTI in cases not handled there yet?

https://bugs.llvm.org/show_bug.cgi?id=48044 - related PR about (not ideal?) cost model for gathers.

xbolva00 added a reviewer: RKSimon.Dec 5 2020, 2:47 AM

I'm agree with @fhahn -- tuning cost model is more right way. Also it could be the case of arch where gathers are missing but it's beneficial to use them for vectorization tree building. They are lowered to scalarized instrs further.

Yes improving the cost model makes more sense (both for gather/scatter and gep vectorization costs) - SLP should only ever create a gather with a constant mask, so at the very least ScalarizeMaskedMemIntrin should do a good job of converting the loads to a BUILD_VECTOR sequence.

If we can get examples of bad vectorization using gathers I can take a look at working out exactly where the cost model is falling down.

RKSimon mentioned this in rGdb900995ed15: [CostModel][X86] getGatherScatterOpCost - use default implementation for alt….Dec 6 2020, 6:08 AM

anton-afanasyev mentioned this in D57779: [SLP] Add support for throttling..Dec 7 2020, 8:43 AM

@fhahn, thank you for the clarification of the isLegalMaskedGather and getGatherScatterOpCost usage.

@RKSimon, thank you for your help. I do have another test case that has bad vectorization with scatters, https://bugs.llvm.org/show_bug.cgi?id=48429.

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

VectorUtils.h

13 lines

lib/

Analysis/

VectorUtils.cpp

13 lines

Transforms/

Vectorize/

LoopVectorize.cpp

26 lines

SLPVectorizer.cpp

17 lines

test/

Transforms/

SLPVectorizer/

X86/

lookahead.ll

2 lines

pr47623.ll

4 lines

pr47629.ll

361 lines

Diff 309677

llvm/include/llvm/Analysis/VectorUtils.h

	Show First 20 Lines • Show All 552 Lines • ▼ Show 20 Lines
	/// predicate mask are known to be true or undef. That is, return true if all			/// predicate mask are known to be true or undef. That is, return true if all
	/// lanes can be assumed active.			/// lanes can be assumed active.
	bool maskIsAllOneOrUndef(Value *Mask);			bool maskIsAllOneOrUndef(Value *Mask);

	/// Given a mask vector of the form <Y x i1>, return an APInt (of bitwidth Y)			/// Given a mask vector of the form <Y x i1>, return an APInt (of bitwidth Y)
	/// for each lane which may be active.			/// for each lane which may be active.
	APInt possiblyDemandedEltsInMask(Value *Mask);			APInt possiblyDemandedEltsInMask(Value *Mask);

				/// A helper function that returns the type of loaded or stored value.
				inline Type getMemInstValueType(Value I) {
				assert((isa<LoadInst>(I) \|\| isa<StoreInst>(I)) &&
				"Expected Load or Store instruction");
				if (auto *LI = dyn_cast<LoadInst>(I))
				return LI->getType();
				return cast<StoreInst>(I)->getValueOperand()->getType();
				}

				/// Returns true if the target machine can represent \p V as a masked gather
				/// or scatter operation.
				bool isLegalGatherOrScatter(Value V, const TargetTransformInfo TTI);

	/// The group of interleaved loads/stores sharing the same stride and			/// The group of interleaved loads/stores sharing the same stride and
	/// close to each other.			/// close to each other.
	///			///
	/// Each member in this group has an index starting from 0, and the largest			/// Each member in this group has an index starting from 0, and the largest
	/// index should be less than interleaved factor, which is equal to the absolute			/// index should be less than interleaved factor, which is equal to the absolute
	/// value of the access's stride.			/// value of the access's stride.
	///			///
	/// E.g. An interleaved load group of factor 4:			/// E.g. An interleaved load group of factor 4:
	▲ Show 20 Lines • Show All 386 Lines • Show Last 20 Lines

llvm/lib/Analysis/VectorUtils.cpp

Show First 20 Lines • Show All 1,366 Lines • ▼ Show 20 Lines	case VFParamKind::GlobalPredicate:
for (unsigned NextPos = Pos + 1; NextPos < NumParams; ++NextPos)		for (unsigned NextPos = Pos + 1; NextPos < NumParams; ++NextPos)
if (Parameters[NextPos].ParamKind == VFParamKind::GlobalPredicate)		if (Parameters[NextPos].ParamKind == VFParamKind::GlobalPredicate)
return false;		return false;
break;		break;
}		}
}		}
return true;		return true;
}		}

		/// Returns true if the target machine can represent \p V as a masked gather
		/// or scatter operation.
		bool llvm::isLegalGatherOrScatter(Value V, const TargetTransformInfo TTI) {
		bool LI = isa<LoadInst>(V);
		bool SI = isa<StoreInst>(V);
		if (!LI && !SI)
		return false;
		auto *Ty = getMemInstValueType(V);
		Align Align = getLoadStoreAlignment(V);
		return (LI && TTI->isLegalMaskedGather(Ty, Align)) \|\|
		(SI && TTI->isLegalMaskedScatter(Ty, Align));
		}

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 329 Lines • ▼ Show 20 Lines

cl::opt<bool> llvm::EnableLoopInterleaving(		cl::opt<bool> llvm::EnableLoopInterleaving(
"interleave-loops", cl::init(true), cl::Hidden,		"interleave-loops", cl::init(true), cl::Hidden,
cl::desc("Enable loop interleaving in Loop vectorization passes"));		cl::desc("Enable loop interleaving in Loop vectorization passes"));
cl::opt<bool> llvm::EnableLoopVectorization(		cl::opt<bool> llvm::EnableLoopVectorization(
"vectorize-loops", cl::init(true), cl::Hidden,		"vectorize-loops", cl::init(true), cl::Hidden,
cl::desc("Run the Loop vectorization passes"));		cl::desc("Run the Loop vectorization passes"));

/// A helper function that returns the type of loaded or stored value.
static Type getMemInstValueType(Value I) {
assert((isa<LoadInst>(I) \|\| isa<StoreInst>(I)) &&
"Expected Load or Store instruction");
if (auto *LI = dyn_cast<LoadInst>(I))
return LI->getType();
return cast<StoreInst>(I)->getValueOperand()->getType();
}

/// A helper function that returns true if the given type is irregular. The		/// A helper function that returns true if the given type is irregular. The
/// type is irregular if its allocated size doesn't equal the store size of an		/// type is irregular if its allocated size doesn't equal the store size of an
/// element of the corresponding vector type at the given vectorization factor.		/// element of the corresponding vector type at the given vectorization factor.
static bool hasIrregularType(Type *Ty, const DataLayout &DL, ElementCount VF) {		static bool hasIrregularType(Type *Ty, const DataLayout &DL, ElementCount VF) {
assert(!VF.isScalable() && "scalable vectors not yet supported.");		assert(!VF.isScalable() && "scalable vectors not yet supported.");
// Determine if an array of VF elements of type Ty is "bitcast compatible"		// Determine if an array of VF elements of type Ty is "bitcast compatible"
// with a <VF x Ty> vector.		// with a <VF x Ty> vector.
if (VF.isVector()) {		if (VF.isVector()) {
▲ Show 20 Lines • Show All 963 Lines • ▼ Show 20 Lines	public:
}		}

/// Returns true if the target machine supports masked gather operation		/// Returns true if the target machine supports masked gather operation
/// for the given \p DataType.		/// for the given \p DataType.
bool isLegalMaskedGather(Type *DataType, Align Alignment) {		bool isLegalMaskedGather(Type *DataType, Align Alignment) {
return TTI.isLegalMaskedGather(DataType, Alignment);		return TTI.isLegalMaskedGather(DataType, Alignment);
}		}

/// Returns true if the target machine can represent \p V as a masked gather
/// or scatter operation.
bool isLegalGatherOrScatter(Value *V) {
bool LI = isa<LoadInst>(V);
bool SI = isa<StoreInst>(V);
if (!LI && !SI)
return false;
auto *Ty = getMemInstValueType(V);
Align Align = getLoadStoreAlignment(V);
return (LI && isLegalMaskedGather(Ty, Align)) \|\|
(SI && isLegalMaskedScatter(Ty, Align));
}

/// Returns true if \p I is an instruction that will be scalarized with		/// Returns true if \p I is an instruction that will be scalarized with
/// predication. Such instructions include conditional stores and		/// predication. Such instructions include conditional stores and
/// instructions that may divide by zero.		/// instructions that may divide by zero.
/// If a non-zero VF has been calculated, we check if I will be scalarized		/// If a non-zero VF has been calculated, we check if I will be scalarized
/// predication for that VF.		/// predication for that VF.
bool isScalarWithPredication(Instruction *I,		bool isScalarWithPredication(Instruction *I,
ElementCount VF = ElementCount::getFixed(1));		ElementCount VF = ElementCount::getFixed(1));

▲ Show 20 Lines • Show All 4,209 Lines • ▼ Show 20 Lines	for (Instruction &I : BB->instructionsWithoutDebug()) {
//		//
// FIXME: The check here attempts to predict whether a load or store will		// FIXME: The check here attempts to predict whether a load or store will
// be vectorized. We only know this for certain after a VF has		// be vectorized. We only know this for certain after a VF has
// been selected. Here, we assume that if an access can be		// been selected. Here, we assume that if an access can be
// vectorized, it will be. We should also look at extending this		// vectorized, it will be. We should also look at extending this
// optimization to non-pointer types.		// optimization to non-pointer types.
//		//
if (T->isPointerTy() && !isConsecutiveLoadOrStore(&I) &&		if (T->isPointerTy() && !isConsecutiveLoadOrStore(&I) &&
!isAccessInterleaved(&I) && !isLegalGatherOrScatter(&I))		!isAccessInterleaved(&I) && !isLegalGatherOrScatter(&I, &TTI))
continue;		continue;

MinWidth = std::min(MinWidth,		MinWidth = std::min(MinWidth,
(unsigned)DL.getTypeSizeInBits(T->getScalarType()));		(unsigned)DL.getTypeSizeInBits(T->getScalarType()));
MaxWidth = std::max(MaxWidth,		MaxWidth = std::max(MaxWidth,
(unsigned)DL.getTypeSizeInBits(T->getScalarType()));		(unsigned)DL.getTypeSizeInBits(T->getScalarType()));
}		}
}		}
▲ Show 20 Lines • Show All 929 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {
continue;		continue;

NumAccesses = Group->getNumMembers();		NumAccesses = Group->getNumMembers();
if (interleavedAccessCanBeWidened(&I, VF))		if (interleavedAccessCanBeWidened(&I, VF))
InterleaveCost = getInterleaveGroupCost(&I, VF);		InterleaveCost = getInterleaveGroupCost(&I, VF);
}		}

unsigned GatherScatterCost =		unsigned GatherScatterCost =
isLegalGatherOrScatter(&I)		isLegalGatherOrScatter(&I, &TTI)
? getGatherScatterCost(&I, VF) * NumAccesses		? getGatherScatterCost(&I, VF) * NumAccesses
: std::numeric_limits<unsigned>::max();		: std::numeric_limits<unsigned>::max();

unsigned ScalarizationCost =		unsigned ScalarizationCost =
getMemInstScalarizationCost(&I, VF) * NumAccesses;		getMemInstScalarizationCost(&I, VF) * NumAccesses;

// Choose better solution for the current VF,		// Choose better solution for the current VF,
// write down this decision and use it during vectorization.		// write down this decision and use it during vectorization.
▲ Show 20 Lines • Show All 2,242 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,868 Lines • ▼ Show 20 Lines	case Instruction::Load: {
ReuseShuffleIndicies, CurrentOrder);		ReuseShuffleIndicies, CurrentOrder);
TE->setOperandsInOrder();		TE->setOperandsInOrder();
LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");
findRootOrder(CurrentOrder);		findRootOrder(CurrentOrder);
++NumOpsWantToKeepOrder[CurrentOrder];		++NumOpsWantToKeepOrder[CurrentOrder];
}		}
return;		return;
}		}
		if (isLegalGatherOrScatter(VL0, TTI)) {
// Vectorizing non-consecutive loads with `llvm.masked.gather`.		// Vectorizing non-consecutive loads with `llvm.masked.gather`.
TreeEntry *TE = newTreeEntry(VL, TreeEntry::ScatterVectorize, Bundle, S,		TreeEntry *TE = newTreeEntry(VL, TreeEntry::ScatterVectorize, Bundle,
UserTreeIdx, ReuseShuffleIndicies);		S, UserTreeIdx, ReuseShuffleIndicies);
TE->setOperandsInOrder();		TE->setOperandsInOrder();
buildTree_rec(PointerOps, Depth + 1, {TE, 0});		buildTree_rec(PointerOps, Depth + 1, {TE, 0});
LLVM_DEBUG(dbgs() << "SLP: added a vector of non-consecutive loads.\n");		LLVM_DEBUG(dbgs() <<
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - LLVM_DEBUG(dbgs() << - "SLP: added a vector of non-consecutive loads.\n"); + LLVM_DEBUG(dbgs() + << "SLP: added a vector of non-consecutive loads.\n"); Lint: Pre-merge checks: clang-format: please reformat the code ``` - LLVM_DEBUG(dbgs() <<…
		"SLP: added a vector of non-consecutive loads.\n");
return;		return;
}		}
		}

LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
return;		return;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
▲ Show 20 Lines • Show All 4,996 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux -mcpu=corei7-avx \| FileCheck %s			; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux -mcpu=corei7-avx -mattr=+avx512vl \| FileCheck %s
	;			;
	; This file tests the look-ahead operand reordering heuristic.			; This file tests the look-ahead operand reordering heuristic.
	;			;
	;			;
	; This checks that operand reordering will reorder the operands of the adds			; This checks that operand reordering will reorder the operands of the adds
	; by taking into consideration the instructions beyond the immediate			; by taking into consideration the instructions beyond the immediate
	; predecessors.			; predecessors.
	;			;
	▲ Show 20 Lines • Show All 631 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+sse2 \| FileCheck %s --check-prefixes=SSE			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+sse2 \| FileCheck %s --check-prefixes=SSE
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx \| FileCheck %s --check-prefixes=AVX			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx \| FileCheck %s --check-prefixes=SSE
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx2 \| FileCheck %s --check-prefixes=AVX			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx2 \| FileCheck %s --check-prefixes=SSE
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512f \| FileCheck %s --check-prefixes=AVX			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512f \| FileCheck %s --check-prefixes=AVX
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512vl \| FileCheck %s --check-prefixes=AVX			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512vl \| FileCheck %s --check-prefixes=AVX


	@b = global [8 x i32] zeroinitializer, align 16			@b = global [8 x i32] zeroinitializer, align 16
	@a = global [8 x i32] zeroinitializer, align 16			@a = global [8 x i32] zeroinitializer, align 16

	define void @foo() {			define void @foo() {
	Show All 31 Lines

llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+sse2 \| FileCheck %s --check-prefixes=CHECK,SSE		; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+sse2 \| FileCheck %s --check-prefixes=CHECK,SSE
; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx \| FileCheck %s --check-prefixes=CHECK,AVX		; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx \| FileCheck %s --check-prefixes=CHECK,AVX
; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx2 \| FileCheck %s --check-prefixes=CHECK,AVX2		; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx2 \| FileCheck %s --check-prefixes=CHECK,AVX
; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512f \| FileCheck %s --check-prefixes=CHECK,AVX512		; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512f \| FileCheck %s --check-prefixes=CHECK,AVX512
; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512vl \| FileCheck %s --check-prefixes=CHECK,AVX512		; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512vl \| FileCheck %s --check-prefixes=CHECK,AVX512

define void @gather_load(i32* noalias nocapture %0, i32* noalias nocapture readonly %1) {		define void @gather_load(i32* noalias nocapture %0, i32* noalias nocapture readonly %1) {
; CHECK-LABEL: @gather_load(		; CHECK-LABEL: @gather_load(
; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TMP1:%.*]], i64 1		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TMP1:%.*]], i64 1
; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[TMP1]], align 4, [[TBAA0:!tbaa !.*]]		; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[TMP1]], align 4, [[TBAA0:!tbaa !.*]]
; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 11		; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 11
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines
; SSE-NEXT: [[TMP16:%.]] = load i32, i32 [[TMP15]], align 4, [[TBAA0]]		; SSE-NEXT: [[TMP16:%.]] = load i32, i32 [[TMP15]], align 4, [[TBAA0]]
; SSE-NEXT: [[TMP17:%.*]] = add nsw i32 [[TMP16]], 4		; SSE-NEXT: [[TMP17:%.*]] = add nsw i32 [[TMP16]], 4
; SSE-NEXT: store i32 [[TMP17]], i32* [[TMP14]], align 4, [[TBAA0]]		; SSE-NEXT: store i32 [[TMP17]], i32* [[TMP14]], align 4, [[TBAA0]]
; SSE-NEXT: ret void		; SSE-NEXT: ret void
;		;
; AVX-LABEL: @gather_load_2(		; AVX-LABEL: @gather_load_2(
; AVX-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TMP1:%.*]], i64 1		; AVX-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TMP1:%.*]], i64 1
; AVX-NEXT: [[TMP4:%.]] = load i32, i32 [[TMP3]], align 4, [[TBAA0:!tbaa !.*]]		; AVX-NEXT: [[TMP4:%.]] = load i32, i32 [[TMP3]], align 4, [[TBAA0:!tbaa !.*]]
; AVX-NEXT: [[TMP5:%.*]] = add nsw i32 [[TMP4]], 1		; AVX-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 10
; AVX-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 1		; AVX-NEXT: [[TMP6:%.]] = load i32, i32 [[TMP5]], align 4, [[TBAA0]]
; AVX-NEXT: store i32 [[TMP5]], i32* [[TMP0]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 3
; AVX-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 10
; AVX-NEXT: [[TMP8:%.]] = load i32, i32 [[TMP7]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP8:%.]] = load i32, i32 [[TMP7]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP9:%.*]] = add nsw i32 [[TMP8]], 2		; AVX-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 5
; AVX-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 2		; AVX-NEXT: [[TMP10:%.]] = load i32, i32 [[TMP9]], align 4, [[TBAA0]]
; AVX-NEXT: store i32 [[TMP9]], i32* [[TMP6]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> undef, i32 [[TMP4]], i32 0
; AVX-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 3		; AVX-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[TMP6]], i32 1
; AVX-NEXT: [[TMP12:%.]] = load i32, i32 [[TMP11]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP12]], i32 [[TMP8]], i32 2
; AVX-NEXT: [[TMP13:%.*]] = add nsw i32 [[TMP12]], 3		; AVX-NEXT: [[TMP14:%.*]] = insertelement <4 x i32> [[TMP13]], i32 [[TMP10]], i32 3
; AVX-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 3		; AVX-NEXT: [[TMP15:%.*]] = add nsw <4 x i32> [[TMP14]], <i32 1, i32 2, i32 3, i32 4>
; AVX-NEXT: store i32 [[TMP13]], i32* [[TMP10]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP16:%.]] = bitcast i32 [[TMP0:%.]] to <4 x i32>
; AVX-NEXT: [[TMP15:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 5		; AVX-NEXT: store <4 x i32> [[TMP15]], <4 x i32>* [[TMP16]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP16:%.]] = load i32, i32 [[TMP15]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP17:%.*]] = add nsw i32 [[TMP16]], 4
; AVX-NEXT: store i32 [[TMP17]], i32* [[TMP14]], align 4, [[TBAA0]]
; AVX-NEXT: ret void		; AVX-NEXT: ret void
;		;
; AVX2-LABEL: @gather_load_2(
; AVX2-NEXT: [[TMP3:%.]] = insertelement <4 x i32> undef, i32* [[TMP1:%.*]], i32 0
; AVX2-NEXT: [[TMP4:%.]] = shufflevector <4 x i32> [[TMP3]], <4 x i32*> undef, <4 x i32> zeroinitializer
; AVX2-NEXT: [[TMP5:%.]] = getelementptr i32, <4 x i32> [[TMP4]], <4 x i64> <i64 1, i64 10, i64 3, i64 5>
; AVX2-NEXT: [[TMP6:%.]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32> [[TMP5]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef), [[TBAA0:!tbaa !.*]]
; AVX2-NEXT: [[TMP7:%.*]] = add nsw <4 x i32> [[TMP6]], <i32 1, i32 2, i32 3, i32 4>
; AVX2-NEXT: [[TMP8:%.]] = bitcast i32 [[TMP0:%.]] to <4 x i32>
; AVX2-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4, [[TBAA0]]
; AVX2-NEXT: ret void
;
; AVX512-LABEL: @gather_load_2(		; AVX512-LABEL: @gather_load_2(
; AVX512-NEXT: [[TMP3:%.]] = insertelement <4 x i32> undef, i32* [[TMP1:%.*]], i32 0		; AVX512-NEXT: [[TMP3:%.]] = insertelement <4 x i32> undef, i32* [[TMP1:%.*]], i32 0
; AVX512-NEXT: [[TMP4:%.]] = shufflevector <4 x i32> [[TMP3]], <4 x i32*> undef, <4 x i32> zeroinitializer		; AVX512-NEXT: [[TMP4:%.]] = shufflevector <4 x i32> [[TMP3]], <4 x i32*> undef, <4 x i32> zeroinitializer
; AVX512-NEXT: [[TMP5:%.]] = getelementptr i32, <4 x i32> [[TMP4]], <4 x i64> <i64 1, i64 10, i64 3, i64 5>		; AVX512-NEXT: [[TMP5:%.]] = getelementptr i32, <4 x i32> [[TMP4]], <4 x i64> <i64 1, i64 10, i64 3, i64 5>
; AVX512-NEXT: [[TMP6:%.]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32> [[TMP5]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef), [[TBAA0:!tbaa !.*]]		; AVX512-NEXT: [[TMP6:%.]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32> [[TMP5]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef), [[TBAA0:!tbaa !.*]]
; AVX512-NEXT: [[TMP7:%.*]] = add nsw <4 x i32> [[TMP6]], <i32 1, i32 2, i32 3, i32 4>		; AVX512-NEXT: [[TMP7:%.*]] = add nsw <4 x i32> [[TMP6]], <i32 1, i32 2, i32 3, i32 4>
; AVX512-NEXT: [[TMP8:%.]] = bitcast i32 [[TMP0:%.]] to <4 x i32>		; AVX512-NEXT: [[TMP8:%.]] = bitcast i32 [[TMP0:%.]] to <4 x i32>
; AVX512-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4, [[TBAA0]]		; AVX512-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4, [[TBAA0]]
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines
; SSE-NEXT: [[TMP30:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 21		; SSE-NEXT: [[TMP30:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 21
; SSE-NEXT: [[TMP31:%.]] = load i32, i32 [[TMP30]], align 4, [[TBAA0]]		; SSE-NEXT: [[TMP31:%.]] = load i32, i32 [[TMP30]], align 4, [[TBAA0]]
; SSE-NEXT: [[TMP32:%.*]] = add i32 [[TMP31]], 4		; SSE-NEXT: [[TMP32:%.*]] = add i32 [[TMP31]], 4
; SSE-NEXT: store i32 [[TMP32]], i32* [[TMP29]], align 4, [[TBAA0]]		; SSE-NEXT: store i32 [[TMP32]], i32* [[TMP29]], align 4, [[TBAA0]]
; SSE-NEXT: ret void		; SSE-NEXT: ret void
;		;
; AVX-LABEL: @gather_load_3(		; AVX-LABEL: @gather_load_3(
; AVX-NEXT: [[TMP3:%.]] = load i32, i32 [[TMP1:%.*]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP3:%.]] = load i32, i32 [[TMP1:%.*]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP4:%.*]] = add i32 [[TMP3]], 1		; AVX-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 11
; AVX-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 1		; AVX-NEXT: [[TMP5:%.]] = load i32, i32 [[TMP4]], align 4, [[TBAA0]]
; AVX-NEXT: store i32 [[TMP4]], i32* [[TMP0]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 4
; AVX-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 11
; AVX-NEXT: [[TMP7:%.]] = load i32, i32 [[TMP6]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP7:%.]] = load i32, i32 [[TMP6]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP8:%.*]] = add i32 [[TMP7]], 2		; AVX-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 15
; AVX-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 2		; AVX-NEXT: [[TMP9:%.]] = load i32, i32 [[TMP8]], align 4, [[TBAA0]]
; AVX-NEXT: store i32 [[TMP8]], i32* [[TMP5]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 18
; AVX-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 4
; AVX-NEXT: [[TMP11:%.]] = load i32, i32 [[TMP10]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP11:%.]] = load i32, i32 [[TMP10]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP12:%.*]] = add i32 [[TMP11]], 3		; AVX-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 9
; AVX-NEXT: [[TMP13:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 3		; AVX-NEXT: [[TMP13:%.]] = load i32, i32 [[TMP12]], align 4, [[TBAA0]]
; AVX-NEXT: store i32 [[TMP12]], i32* [[TMP9]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 6
; AVX-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 15
; AVX-NEXT: [[TMP15:%.]] = load i32, i32 [[TMP14]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP15:%.]] = load i32, i32 [[TMP14]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP16:%.*]] = add i32 [[TMP15]], 4		; AVX-NEXT: [[TMP16:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 21
; AVX-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 4		; AVX-NEXT: [[TMP17:%.]] = load i32, i32 [[TMP16]], align 4, [[TBAA0]]
; AVX-NEXT: store i32 [[TMP16]], i32* [[TMP13]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP18:%.*]] = insertelement <8 x i32> undef, i32 [[TMP3]], i32 0
; AVX-NEXT: [[TMP18:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 18		; AVX-NEXT: [[TMP19:%.*]] = insertelement <8 x i32> [[TMP18]], i32 [[TMP5]], i32 1
; AVX-NEXT: [[TMP19:%.]] = load i32, i32 [[TMP18]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP20:%.*]] = insertelement <8 x i32> [[TMP19]], i32 [[TMP7]], i32 2
; AVX-NEXT: [[TMP20:%.*]] = add i32 [[TMP19]], 1		; AVX-NEXT: [[TMP21:%.*]] = insertelement <8 x i32> [[TMP20]], i32 [[TMP9]], i32 3
; AVX-NEXT: [[TMP21:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 5		; AVX-NEXT: [[TMP22:%.*]] = insertelement <8 x i32> [[TMP21]], i32 [[TMP11]], i32 4
; AVX-NEXT: store i32 [[TMP20]], i32* [[TMP17]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP23:%.*]] = insertelement <8 x i32> [[TMP22]], i32 [[TMP13]], i32 5
; AVX-NEXT: [[TMP22:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 9		; AVX-NEXT: [[TMP24:%.*]] = insertelement <8 x i32> [[TMP23]], i32 [[TMP15]], i32 6
; AVX-NEXT: [[TMP23:%.]] = load i32, i32 [[TMP22]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP25:%.*]] = insertelement <8 x i32> [[TMP24]], i32 [[TMP17]], i32 7
; AVX-NEXT: [[TMP24:%.*]] = add i32 [[TMP23]], 2		; AVX-NEXT: [[TMP26:%.*]] = add <8 x i32> [[TMP25]], <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>
; AVX-NEXT: [[TMP25:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 6		; AVX-NEXT: [[TMP27:%.]] = bitcast i32 [[TMP0:%.]] to <8 x i32>
; AVX-NEXT: store i32 [[TMP24]], i32* [[TMP21]], align 4, [[TBAA0]]		; AVX-NEXT: store <8 x i32> [[TMP26]], <8 x i32>* [[TMP27]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP26:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 6
; AVX-NEXT: [[TMP27:%.]] = load i32, i32 [[TMP26]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP28:%.*]] = add i32 [[TMP27]], 3
; AVX-NEXT: [[TMP29:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 7
; AVX-NEXT: store i32 [[TMP28]], i32* [[TMP25]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP30:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 21
; AVX-NEXT: [[TMP31:%.]] = load i32, i32 [[TMP30]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP32:%.*]] = add i32 [[TMP31]], 4
; AVX-NEXT: store i32 [[TMP32]], i32* [[TMP29]], align 4, [[TBAA0]]
; AVX-NEXT: ret void		; AVX-NEXT: ret void
;		;
; AVX2-LABEL: @gather_load_3(
; AVX2-NEXT: [[TMP3:%.]] = load i32, i32 [[TMP1:%.*]], align 4, [[TBAA0]]
; AVX2-NEXT: [[TMP4:%.*]] = add i32 [[TMP3]], 1
; AVX2-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 1
; AVX2-NEXT: store i32 [[TMP4]], i32* [[TMP0]], align 4, [[TBAA0]]
; AVX2-NEXT: [[TMP6:%.]] = insertelement <4 x i32> undef, i32* [[TMP1]], i32 0
; AVX2-NEXT: [[TMP7:%.]] = shufflevector <4 x i32> [[TMP6]], <4 x i32*> undef, <4 x i32> zeroinitializer
; AVX2-NEXT: [[TMP8:%.]] = getelementptr i32, <4 x i32> [[TMP7]], <4 x i64> <i64 11, i64 4, i64 15, i64 18>
; AVX2-NEXT: [[TMP9:%.]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32> [[TMP8]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef), [[TBAA0]]
; AVX2-NEXT: [[TMP10:%.*]] = add <4 x i32> [[TMP9]], <i32 2, i32 3, i32 4, i32 1>
; AVX2-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 5
; AVX2-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP5]] to <4 x i32>*
; AVX2-NEXT: store <4 x i32> [[TMP10]], <4 x i32>* [[TMP12]], align 4, [[TBAA0]]
; AVX2-NEXT: [[TMP13:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 9
; AVX2-NEXT: [[TMP14:%.]] = load i32, i32 [[TMP13]], align 4, [[TBAA0]]
; AVX2-NEXT: [[TMP15:%.*]] = add i32 [[TMP14]], 2
; AVX2-NEXT: [[TMP16:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 6
; AVX2-NEXT: store i32 [[TMP15]], i32* [[TMP11]], align 4, [[TBAA0]]
; AVX2-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 6
; AVX2-NEXT: [[TMP18:%.]] = load i32, i32 [[TMP17]], align 4, [[TBAA0]]
; AVX2-NEXT: [[TMP19:%.*]] = add i32 [[TMP18]], 3
; AVX2-NEXT: [[TMP20:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 7
; AVX2-NEXT: store i32 [[TMP19]], i32* [[TMP16]], align 4, [[TBAA0]]
; AVX2-NEXT: [[TMP21:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 21
; AVX2-NEXT: [[TMP22:%.]] = load i32, i32 [[TMP21]], align 4, [[TBAA0]]
; AVX2-NEXT: [[TMP23:%.*]] = add i32 [[TMP22]], 4
; AVX2-NEXT: store i32 [[TMP23]], i32* [[TMP20]], align 4, [[TBAA0]]
; AVX2-NEXT: ret void
;
; AVX512-LABEL: @gather_load_3(		; AVX512-LABEL: @gather_load_3(
; AVX512-NEXT: [[TMP3:%.]] = load i32, i32 [[TMP1:%.*]], align 4, [[TBAA0]]		; AVX512-NEXT: [[TMP3:%.]] = load i32, i32 [[TMP1:%.*]], align 4, [[TBAA0]]
; AVX512-NEXT: [[TMP4:%.*]] = add i32 [[TMP3]], 1		; AVX512-NEXT: [[TMP4:%.*]] = add i32 [[TMP3]], 1
; AVX512-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 1		; AVX512-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 1
; AVX512-NEXT: store i32 [[TMP4]], i32* [[TMP0]], align 4, [[TBAA0]]		; AVX512-NEXT: store i32 [[TMP4]], i32* [[TMP0]], align 4, [[TBAA0]]
; AVX512-NEXT: [[TMP6:%.]] = insertelement <4 x i32> undef, i32* [[TMP1]], i32 0		; AVX512-NEXT: [[TMP6:%.]] = insertelement <4 x i32> undef, i32* [[TMP1]], i32 0
; AVX512-NEXT: [[TMP7:%.]] = shufflevector <4 x i32> [[TMP6]], <4 x i32*> undef, <4 x i32> zeroinitializer		; AVX512-NEXT: [[TMP7:%.]] = shufflevector <4 x i32> [[TMP6]], <4 x i32*> undef, <4 x i32> zeroinitializer
; AVX512-NEXT: [[TMP8:%.]] = getelementptr i32, <4 x i32> [[TMP7]], <4 x i64> <i64 11, i64 4, i64 15, i64 18>		; AVX512-NEXT: [[TMP8:%.]] = getelementptr i32, <4 x i32> [[TMP7]], <4 x i64> <i64 11, i64 4, i64 15, i64 18>
▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines
; SSE-NEXT: store i32 [[T16]], i32* [[T13]], align 4, [[TBAA0]]		; SSE-NEXT: store i32 [[T16]], i32* [[T13]], align 4, [[TBAA0]]
; SSE-NEXT: store i32 [[T20]], i32* [[T17]], align 4, [[TBAA0]]		; SSE-NEXT: store i32 [[T20]], i32* [[T17]], align 4, [[TBAA0]]
; SSE-NEXT: store i32 [[T24]], i32* [[T21]], align 4, [[TBAA0]]		; SSE-NEXT: store i32 [[T24]], i32* [[T21]], align 4, [[TBAA0]]
; SSE-NEXT: store i32 [[T28]], i32* [[T25]], align 4, [[TBAA0]]		; SSE-NEXT: store i32 [[T28]], i32* [[T25]], align 4, [[TBAA0]]
; SSE-NEXT: store i32 [[T32]], i32* [[T29]], align 4, [[TBAA0]]		; SSE-NEXT: store i32 [[T32]], i32* [[T29]], align 4, [[TBAA0]]
; SSE-NEXT: ret void		; SSE-NEXT: ret void
;		;
; AVX-LABEL: @gather_load_4(		; AVX-LABEL: @gather_load_4(
; AVX-NEXT: [[T5:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 1
; AVX-NEXT: [[T6:%.]] = getelementptr inbounds i32, i32 [[T1:%.*]], i64 11		; AVX-NEXT: [[T6:%.]] = getelementptr inbounds i32, i32 [[T1:%.*]], i64 11
; AVX-NEXT: [[T9:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 2
; AVX-NEXT: [[T10:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 4		; AVX-NEXT: [[T10:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 4
; AVX-NEXT: [[T13:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 3
; AVX-NEXT: [[T14:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 15		; AVX-NEXT: [[T14:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 15
; AVX-NEXT: [[T17:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 4
; AVX-NEXT: [[T18:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 18		; AVX-NEXT: [[T18:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 18
; AVX-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5
; AVX-NEXT: [[T22:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 9		; AVX-NEXT: [[T22:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 9
; AVX-NEXT: [[T25:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 6
; AVX-NEXT: [[T26:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 6		; AVX-NEXT: [[T26:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 6
; AVX-NEXT: [[T29:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 7
; AVX-NEXT: [[T30:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 21		; AVX-NEXT: [[T30:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 21
; AVX-NEXT: [[T3:%.]] = load i32, i32 [[T1]], align 4, [[TBAA0]]		; AVX-NEXT: [[T3:%.]] = load i32, i32 [[T1]], align 4, [[TBAA0]]
; AVX-NEXT: [[T7:%.]] = load i32, i32 [[T6]], align 4, [[TBAA0]]		; AVX-NEXT: [[T7:%.]] = load i32, i32 [[T6]], align 4, [[TBAA0]]
; AVX-NEXT: [[T11:%.]] = load i32, i32 [[T10]], align 4, [[TBAA0]]		; AVX-NEXT: [[T11:%.]] = load i32, i32 [[T10]], align 4, [[TBAA0]]
; AVX-NEXT: [[T15:%.]] = load i32, i32 [[T14]], align 4, [[TBAA0]]		; AVX-NEXT: [[T15:%.]] = load i32, i32 [[T14]], align 4, [[TBAA0]]
; AVX-NEXT: [[T19:%.]] = load i32, i32 [[T18]], align 4, [[TBAA0]]		; AVX-NEXT: [[T19:%.]] = load i32, i32 [[T18]], align 4, [[TBAA0]]
; AVX-NEXT: [[T23:%.]] = load i32, i32 [[T22]], align 4, [[TBAA0]]		; AVX-NEXT: [[T23:%.]] = load i32, i32 [[T22]], align 4, [[TBAA0]]
; AVX-NEXT: [[T27:%.]] = load i32, i32 [[T26]], align 4, [[TBAA0]]		; AVX-NEXT: [[T27:%.]] = load i32, i32 [[T26]], align 4, [[TBAA0]]
; AVX-NEXT: [[T31:%.]] = load i32, i32 [[T30]], align 4, [[TBAA0]]		; AVX-NEXT: [[T31:%.]] = load i32, i32 [[T30]], align 4, [[TBAA0]]
; AVX-NEXT: [[T4:%.*]] = add i32 [[T3]], 1		; AVX-NEXT: [[TMP1:%.*]] = insertelement <8 x i32> undef, i32 [[T3]], i32 0
; AVX-NEXT: [[T8:%.*]] = add i32 [[T7]], 2		; AVX-NEXT: [[TMP2:%.*]] = insertelement <8 x i32> [[TMP1]], i32 [[T7]], i32 1
; AVX-NEXT: [[T12:%.*]] = add i32 [[T11]], 3		; AVX-NEXT: [[TMP3:%.*]] = insertelement <8 x i32> [[TMP2]], i32 [[T11]], i32 2
; AVX-NEXT: [[T16:%.*]] = add i32 [[T15]], 4		; AVX-NEXT: [[TMP4:%.*]] = insertelement <8 x i32> [[TMP3]], i32 [[T15]], i32 3
; AVX-NEXT: [[T20:%.*]] = add i32 [[T19]], 1		; AVX-NEXT: [[TMP5:%.*]] = insertelement <8 x i32> [[TMP4]], i32 [[T19]], i32 4
; AVX-NEXT: [[T24:%.*]] = add i32 [[T23]], 2		; AVX-NEXT: [[TMP6:%.*]] = insertelement <8 x i32> [[TMP5]], i32 [[T23]], i32 5
; AVX-NEXT: [[T28:%.*]] = add i32 [[T27]], 3		; AVX-NEXT: [[TMP7:%.*]] = insertelement <8 x i32> [[TMP6]], i32 [[T27]], i32 6
; AVX-NEXT: [[T32:%.*]] = add i32 [[T31]], 4		; AVX-NEXT: [[TMP8:%.*]] = insertelement <8 x i32> [[TMP7]], i32 [[T31]], i32 7
; AVX-NEXT: store i32 [[T4]], i32* [[T0]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP9:%.*]] = add <8 x i32> [[TMP8]], <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>
; AVX-NEXT: store i32 [[T8]], i32* [[T5]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP10:%.]] = bitcast i32 [[T0:%.]] to <8 x i32>
; AVX-NEXT: store i32 [[T12]], i32* [[T9]], align 4, [[TBAA0]]		; AVX-NEXT: store <8 x i32> [[TMP9]], <8 x i32>* [[TMP10]], align 4, [[TBAA0]]
; AVX-NEXT: store i32 [[T16]], i32* [[T13]], align 4, [[TBAA0]]
; AVX-NEXT: store i32 [[T20]], i32* [[T17]], align 4, [[TBAA0]]
; AVX-NEXT: store i32 [[T24]], i32* [[T21]], align 4, [[TBAA0]]
; AVX-NEXT: store i32 [[T28]], i32* [[T25]], align 4, [[TBAA0]]
; AVX-NEXT: store i32 [[T32]], i32* [[T29]], align 4, [[TBAA0]]
; AVX-NEXT: ret void		; AVX-NEXT: ret void
;		;
; AVX2-LABEL: @gather_load_4(
; AVX2-NEXT: [[T5:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 1
; AVX2-NEXT: [[TMP1:%.]] = insertelement <4 x i32> undef, i32* [[T1:%.*]], i32 0
; AVX2-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[TMP1]], <4 x i32*> undef, <4 x i32> zeroinitializer
; AVX2-NEXT: [[TMP3:%.]] = getelementptr i32, <4 x i32> [[TMP2]], <4 x i64> <i64 11, i64 4, i64 15, i64 18>
; AVX2-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5
; AVX2-NEXT: [[T22:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 9
; AVX2-NEXT: [[T25:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 6
; AVX2-NEXT: [[T26:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 6
; AVX2-NEXT: [[T29:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 7
; AVX2-NEXT: [[T30:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 21
; AVX2-NEXT: [[T3:%.]] = load i32, i32 [[T1]], align 4, [[TBAA0]]
; AVX2-NEXT: [[TMP4:%.]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32> [[TMP3]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef), [[TBAA0]]
; AVX2-NEXT: [[T23:%.]] = load i32, i32 [[T22]], align 4, [[TBAA0]]
; AVX2-NEXT: [[T27:%.]] = load i32, i32 [[T26]], align 4, [[TBAA0]]
; AVX2-NEXT: [[T31:%.]] = load i32, i32 [[T30]], align 4, [[TBAA0]]
; AVX2-NEXT: [[T4:%.*]] = add i32 [[T3]], 1
; AVX2-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], <i32 2, i32 3, i32 4, i32 1>
; AVX2-NEXT: [[T24:%.*]] = add i32 [[T23]], 2
; AVX2-NEXT: [[T28:%.*]] = add i32 [[T27]], 3
; AVX2-NEXT: [[T32:%.*]] = add i32 [[T31]], 4
; AVX2-NEXT: store i32 [[T4]], i32* [[T0]], align 4, [[TBAA0]]
; AVX2-NEXT: [[TMP6:%.]] = bitcast i32 [[T5]] to <4 x i32>*
; AVX2-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4, [[TBAA0]]
; AVX2-NEXT: store i32 [[T24]], i32* [[T21]], align 4, [[TBAA0]]
; AVX2-NEXT: store i32 [[T28]], i32* [[T25]], align 4, [[TBAA0]]
; AVX2-NEXT: store i32 [[T32]], i32* [[T29]], align 4, [[TBAA0]]
; AVX2-NEXT: ret void
;
; AVX512-LABEL: @gather_load_4(		; AVX512-LABEL: @gather_load_4(
; AVX512-NEXT: [[T5:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 1		; AVX512-NEXT: [[T5:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 1
; AVX512-NEXT: [[TMP1:%.]] = insertelement <4 x i32> undef, i32* [[T1:%.*]], i32 0		; AVX512-NEXT: [[TMP1:%.]] = insertelement <4 x i32> undef, i32* [[T1:%.*]], i32 0
; AVX512-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[TMP1]], <4 x i32*> undef, <4 x i32> zeroinitializer		; AVX512-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[TMP1]], <4 x i32*> undef, <4 x i32> zeroinitializer
; AVX512-NEXT: [[TMP3:%.]] = getelementptr i32, <4 x i32> [[TMP2]], <4 x i64> <i64 11, i64 4, i64 15, i64 18>		; AVX512-NEXT: [[TMP3:%.]] = getelementptr i32, <4 x i32> [[TMP2]], <4 x i64> <i64 11, i64 4, i64 15, i64 18>
; AVX512-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5		; AVX512-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5
; AVX512-NEXT: [[T22:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 9		; AVX512-NEXT: [[T22:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 9
; AVX512-NEXT: [[T25:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 6		; AVX512-NEXT: [[T25:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 6
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	;
store i32 %t32, i32* %t29, align 4, !tbaa !2		store i32 %t32, i32* %t29, align 4, !tbaa !2

ret void		ret void
}		}


define void @gather_load_div(float* noalias nocapture %0, float* noalias nocapture readonly %1) {		define void @gather_load_div(float* noalias nocapture %0, float* noalias nocapture readonly %1) {
; SSE-LABEL: @gather_load_div(		; SSE-LABEL: @gather_load_div(
; SSE-NEXT: [[TMP3:%.]] = getelementptr inbounds float, float [[TMP1:%.*]], i64 10		; SSE-NEXT: [[TMP3:%.]] = load float, float [[TMP1:%.*]], align 4, !tbaa !0
; SSE-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[TMP1]], i64 3		; SSE-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[TMP1]], i64 4
; SSE-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP1]], i64 14		; SSE-NEXT: [[TMP5:%.]] = load float, float [[TMP4]], align 4, !tbaa !0
; SSE-NEXT: [[TMP6:%.]] = insertelement <4 x float> undef, float* [[TMP1]], i32 0		; SSE-NEXT: [[TMP6:%.]] = getelementptr inbounds float, float [[TMP1]], i64 10
; SSE-NEXT: [[TMP7:%.]] = insertelement <4 x float> [[TMP6]], float* [[TMP3]], i32 1		; SSE-NEXT: [[TMP7:%.]] = load float, float [[TMP6]], align 4, !tbaa !0
; SSE-NEXT: [[TMP8:%.]] = insertelement <4 x float> [[TMP7]], float* [[TMP4]], i32 2		; SSE-NEXT: [[TMP8:%.]] = getelementptr inbounds float, float [[TMP1]], i64 13
; SSE-NEXT: [[TMP9:%.]] = insertelement <4 x float> [[TMP8]], float* [[TMP5]], i32 3		; SSE-NEXT: [[TMP9:%.]] = load float, float [[TMP8]], align 4, !tbaa !0
; SSE-NEXT: [[TMP10:%.]] = call <4 x float> @llvm.masked.gather.v4f32.v4p0f32(<4 x float> [[TMP9]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x float> undef), [[TBAA0]]		; SSE-NEXT: [[TMP10:%.]] = getelementptr inbounds float, float [[TMP1]], i64 3
; SSE-NEXT: [[TMP11:%.]] = shufflevector <4 x float> [[TMP6]], <4 x float*> undef, <4 x i32> zeroinitializer		; SSE-NEXT: [[TMP11:%.]] = load float, float [[TMP10]], align 4, !tbaa !0
; SSE-NEXT: [[TMP12:%.]] = getelementptr float, <4 x float> [[TMP11]], <4 x i64> <i64 4, i64 13, i64 11, i64 44>		; SSE-NEXT: [[TMP12:%.]] = getelementptr inbounds float, float [[TMP1]], i64 11
; SSE-NEXT: [[TMP13:%.]] = call <4 x float> @llvm.masked.gather.v4f32.v4p0f32(<4 x float> [[TMP12]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x float> undef), [[TBAA0]]		; SSE-NEXT: [[TMP13:%.]] = load float, float [[TMP12]], align 4, !tbaa !0
; SSE-NEXT: [[TMP14:%.*]] = fdiv <4 x float> [[TMP10]], [[TMP13]]		; SSE-NEXT: [[TMP14:%.]] = getelementptr inbounds float, float [[TMP1]], i64 14
; SSE-NEXT: [[TMP15:%.]] = getelementptr inbounds float, float [[TMP0:%.*]], i64 4		; SSE-NEXT: [[TMP15:%.]] = load float, float [[TMP14]], align 4, !tbaa !0
; SSE-NEXT: [[TMP16:%.]] = bitcast float [[TMP0]] to <4 x float>*		; SSE-NEXT: [[TMP16:%.]] = getelementptr inbounds float, float [[TMP1]], i64 44
; SSE-NEXT: store <4 x float> [[TMP14]], <4 x float>* [[TMP16]], align 4, [[TBAA0]]		; SSE-NEXT: [[TMP17:%.]] = load float, float [[TMP16]], align 4, !tbaa !0
; SSE-NEXT: [[TMP17:%.]] = getelementptr float, <4 x float> [[TMP11]], <4 x i64> <i64 17, i64 8, i64 5, i64 20>		; SSE-NEXT: [[TMP18:%.*]] = insertelement <4 x float> undef, float [[TMP3]], i32 0
; SSE-NEXT: [[TMP18:%.]] = call <4 x float> @llvm.masked.gather.v4f32.v4p0f32(<4 x float> [[TMP17]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x float> undef), [[TBAA0]]		; SSE-NEXT: [[TMP19:%.*]] = insertelement <4 x float> [[TMP18]], float [[TMP7]], i32 1
; SSE-NEXT: [[TMP19:%.]] = getelementptr float, <4 x float> [[TMP11]], <4 x i64> <i64 33, i64 30, i64 27, i64 23>		; SSE-NEXT: [[TMP20:%.*]] = insertelement <4 x float> [[TMP19]], float [[TMP11]], i32 2
; SSE-NEXT: [[TMP20:%.]] = call <4 x float> @llvm.masked.gather.v4f32.v4p0f32(<4 x float> [[TMP19]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x float> undef), [[TBAA0]]		; SSE-NEXT: [[TMP21:%.*]] = insertelement <4 x float> [[TMP20]], float [[TMP15]], i32 3
; SSE-NEXT: [[TMP21:%.*]] = fdiv <4 x float> [[TMP18]], [[TMP20]]		; SSE-NEXT: [[TMP22:%.*]] = insertelement <4 x float> undef, float [[TMP5]], i32 0
; SSE-NEXT: [[TMP22:%.]] = bitcast float [[TMP15]] to <4 x float>*		; SSE-NEXT: [[TMP23:%.*]] = insertelement <4 x float> [[TMP22]], float [[TMP9]], i32 1
; SSE-NEXT: store <4 x float> [[TMP21]], <4 x float>* [[TMP22]], align 4, [[TBAA0]]		; SSE-NEXT: [[TMP24:%.*]] = insertelement <4 x float> [[TMP23]], float [[TMP13]], i32 2
		; SSE-NEXT: [[TMP25:%.*]] = insertelement <4 x float> [[TMP24]], float [[TMP17]], i32 3
		; SSE-NEXT: [[TMP26:%.*]] = fdiv <4 x float> [[TMP21]], [[TMP25]]
		; SSE-NEXT: [[TMP27:%.]] = getelementptr inbounds float, float [[TMP0:%.*]], i64 4
		; SSE-NEXT: [[TMP28:%.]] = bitcast float [[TMP0]] to <4 x float>*
		; SSE-NEXT: store <4 x float> [[TMP26]], <4 x float>* [[TMP28]], align 4, !tbaa !0
		; SSE-NEXT: [[TMP29:%.]] = getelementptr inbounds float, float [[TMP1]], i64 17
		; SSE-NEXT: [[TMP30:%.]] = load float, float [[TMP29]], align 4, !tbaa !0
		; SSE-NEXT: [[TMP31:%.]] = getelementptr inbounds float, float [[TMP1]], i64 33
		; SSE-NEXT: [[TMP32:%.]] = load float, float [[TMP31]], align 4, !tbaa !0
		; SSE-NEXT: [[TMP33:%.]] = getelementptr inbounds float, float [[TMP1]], i64 8
		; SSE-NEXT: [[TMP34:%.]] = load float, float [[TMP33]], align 4, !tbaa !0
		; SSE-NEXT: [[TMP35:%.]] = getelementptr inbounds float, float [[TMP1]], i64 30
		; SSE-NEXT: [[TMP36:%.]] = load float, float [[TMP35]], align 4, !tbaa !0
		; SSE-NEXT: [[TMP37:%.]] = getelementptr inbounds float, float [[TMP1]], i64 5
		; SSE-NEXT: [[TMP38:%.]] = load float, float [[TMP37]], align 4, !tbaa !0
		; SSE-NEXT: [[TMP39:%.]] = getelementptr inbounds float, float [[TMP1]], i64 27
		; SSE-NEXT: [[TMP40:%.]] = load float, float [[TMP39]], align 4, !tbaa !0
		; SSE-NEXT: [[TMP41:%.]] = getelementptr inbounds float, float [[TMP1]], i64 20
		; SSE-NEXT: [[TMP42:%.]] = load float, float [[TMP41]], align 4, !tbaa !0
		; SSE-NEXT: [[TMP43:%.]] = getelementptr inbounds float, float [[TMP1]], i64 23
		; SSE-NEXT: [[TMP44:%.]] = load float, float [[TMP43]], align 4, !tbaa !0
		; SSE-NEXT: [[TMP45:%.*]] = insertelement <4 x float> undef, float [[TMP30]], i32 0
		; SSE-NEXT: [[TMP46:%.*]] = insertelement <4 x float> [[TMP45]], float [[TMP34]], i32 1
		; SSE-NEXT: [[TMP47:%.*]] = insertelement <4 x float> [[TMP46]], float [[TMP38]], i32 2
		; SSE-NEXT: [[TMP48:%.*]] = insertelement <4 x float> [[TMP47]], float [[TMP42]], i32 3
		; SSE-NEXT: [[TMP49:%.*]] = insertelement <4 x float> undef, float [[TMP32]], i32 0
		; SSE-NEXT: [[TMP50:%.*]] = insertelement <4 x float> [[TMP49]], float [[TMP36]], i32 1
		; SSE-NEXT: [[TMP51:%.*]] = insertelement <4 x float> [[TMP50]], float [[TMP40]], i32 2
		; SSE-NEXT: [[TMP52:%.*]] = insertelement <4 x float> [[TMP51]], float [[TMP44]], i32 3
		; SSE-NEXT: [[TMP53:%.*]] = fdiv <4 x float> [[TMP48]], [[TMP52]]
		; SSE-NEXT: [[TMP54:%.]] = bitcast float [[TMP27]] to <4 x float>*
		; SSE-NEXT: store <4 x float> [[TMP53]], <4 x float>* [[TMP54]], align 4, !tbaa !0
; SSE-NEXT: ret void		; SSE-NEXT: ret void
;		;
; AVX-LABEL: @gather_load_div(		; AVX-LABEL: @gather_load_div(
; AVX-NEXT: [[TMP3:%.]] = getelementptr inbounds float, float [[TMP1:%.*]], i64 10		; AVX-NEXT: [[TMP3:%.]] = load float, float [[TMP1:%.]], align 4, [[TBAA0:!tbaa !.]]
; AVX-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[TMP1]], i64 3		; AVX-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[TMP1]], i64 4
; AVX-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP1]], i64 14		; AVX-NEXT: [[TMP5:%.]] = load float, float [[TMP4]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP6:%.]] = getelementptr inbounds float, float [[TMP1]], i64 17		; AVX-NEXT: [[TMP6:%.]] = getelementptr inbounds float, float [[TMP1]], i64 10
; AVX-NEXT: [[TMP7:%.]] = getelementptr inbounds float, float [[TMP1]], i64 8		; AVX-NEXT: [[TMP7:%.]] = load float, float [[TMP6]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP8:%.]] = getelementptr inbounds float, float [[TMP1]], i64 5		; AVX-NEXT: [[TMP8:%.]] = getelementptr inbounds float, float [[TMP1]], i64 13
; AVX-NEXT: [[TMP9:%.]] = getelementptr inbounds float, float [[TMP1]], i64 20		; AVX-NEXT: [[TMP9:%.]] = load float, float [[TMP8]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP10:%.]] = insertelement <8 x float> undef, float* [[TMP1]], i32 0		; AVX-NEXT: [[TMP10:%.]] = getelementptr inbounds float, float [[TMP1]], i64 3
; AVX-NEXT: [[TMP11:%.]] = insertelement <8 x float> [[TMP10]], float* [[TMP3]], i32 1		; AVX-NEXT: [[TMP11:%.]] = load float, float [[TMP10]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP12:%.]] = insertelement <8 x float> [[TMP11]], float* [[TMP4]], i32 2		; AVX-NEXT: [[TMP12:%.]] = getelementptr inbounds float, float [[TMP1]], i64 11
; AVX-NEXT: [[TMP13:%.]] = insertelement <8 x float> [[TMP12]], float* [[TMP5]], i32 3		; AVX-NEXT: [[TMP13:%.]] = load float, float [[TMP12]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP14:%.]] = insertelement <8 x float> [[TMP13]], float* [[TMP6]], i32 4		; AVX-NEXT: [[TMP14:%.]] = getelementptr inbounds float, float [[TMP1]], i64 14
; AVX-NEXT: [[TMP15:%.]] = insertelement <8 x float> [[TMP14]], float* [[TMP7]], i32 5		; AVX-NEXT: [[TMP15:%.]] = load float, float [[TMP14]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP16:%.]] = insertelement <8 x float> [[TMP15]], float* [[TMP8]], i32 6		; AVX-NEXT: [[TMP16:%.]] = getelementptr inbounds float, float [[TMP1]], i64 44
; AVX-NEXT: [[TMP17:%.]] = insertelement <8 x float> [[TMP16]], float* [[TMP9]], i32 7		; AVX-NEXT: [[TMP17:%.]] = load float, float [[TMP16]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP18:%.]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float> [[TMP17]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), [[TBAA0]]		; AVX-NEXT: [[TMP18:%.]] = getelementptr inbounds float, float [[TMP1]], i64 17
; AVX-NEXT: [[TMP19:%.]] = shufflevector <8 x float> [[TMP10]], <8 x float*> undef, <8 x i32> zeroinitializer		; AVX-NEXT: [[TMP19:%.]] = load float, float [[TMP18]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP20:%.]] = getelementptr float, <8 x float> [[TMP19]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 30, i64 27, i64 23>		; AVX-NEXT: [[TMP20:%.]] = getelementptr inbounds float, float [[TMP1]], i64 33
; AVX-NEXT: [[TMP21:%.]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float> [[TMP20]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), [[TBAA0]]		; AVX-NEXT: [[TMP21:%.]] = load float, float [[TMP20]], align 4, [[TBAA0]]
; AVX-NEXT: [[TMP22:%.*]] = fdiv <8 x float> [[TMP18]], [[TMP21]]		; AVX-NEXT: [[TMP22:%.]] = getelementptr inbounds float, float [[TMP1]], i64 8
; AVX-NEXT: [[TMP23:%.]] = bitcast float [[TMP0:%.]] to <8 x float>		; AVX-NEXT: [[TMP23:%.]] = load float, float [[TMP22]], align 4, [[TBAA0]]
; AVX-NEXT: store <8 x float> [[TMP22]], <8 x float>* [[TMP23]], align 4, [[TBAA0]]		; AVX-NEXT: [[TMP24:%.]] = getelementptr inbounds float, float [[TMP1]], i64 30
		; AVX-NEXT: [[TMP25:%.]] = load float, float [[TMP24]], align 4, [[TBAA0]]
		; AVX-NEXT: [[TMP26:%.]] = getelementptr inbounds float, float [[TMP1]], i64 5
		; AVX-NEXT: [[TMP27:%.]] = load float, float [[TMP26]], align 4, [[TBAA0]]
		; AVX-NEXT: [[TMP28:%.]] = getelementptr inbounds float, float [[TMP1]], i64 27
		; AVX-NEXT: [[TMP29:%.]] = load float, float [[TMP28]], align 4, [[TBAA0]]
		; AVX-NEXT: [[TMP30:%.]] = getelementptr inbounds float, float [[TMP1]], i64 20
		; AVX-NEXT: [[TMP31:%.]] = load float, float [[TMP30]], align 4, [[TBAA0]]
		; AVX-NEXT: [[TMP32:%.]] = getelementptr inbounds float, float [[TMP1]], i64 23
		; AVX-NEXT: [[TMP33:%.]] = load float, float [[TMP32]], align 4, [[TBAA0]]
		; AVX-NEXT: [[TMP34:%.*]] = insertelement <8 x float> undef, float [[TMP3]], i32 0
		; AVX-NEXT: [[TMP35:%.*]] = insertelement <8 x float> [[TMP34]], float [[TMP7]], i32 1
		; AVX-NEXT: [[TMP36:%.*]] = insertelement <8 x float> [[TMP35]], float [[TMP11]], i32 2
		; AVX-NEXT: [[TMP37:%.*]] = insertelement <8 x float> [[TMP36]], float [[TMP15]], i32 3
		; AVX-NEXT: [[TMP38:%.*]] = insertelement <8 x float> [[TMP37]], float [[TMP19]], i32 4
		; AVX-NEXT: [[TMP39:%.*]] = insertelement <8 x float> [[TMP38]], float [[TMP23]], i32 5
		; AVX-NEXT: [[TMP40:%.*]] = insertelement <8 x float> [[TMP39]], float [[TMP27]], i32 6
		; AVX-NEXT: [[TMP41:%.*]] = insertelement <8 x float> [[TMP40]], float [[TMP31]], i32 7
		; AVX-NEXT: [[TMP42:%.*]] = insertelement <8 x float> undef, float [[TMP5]], i32 0
		; AVX-NEXT: [[TMP43:%.*]] = insertelement <8 x float> [[TMP42]], float [[TMP9]], i32 1
		; AVX-NEXT: [[TMP44:%.*]] = insertelement <8 x float> [[TMP43]], float [[TMP13]], i32 2
		; AVX-NEXT: [[TMP45:%.*]] = insertelement <8 x float> [[TMP44]], float [[TMP17]], i32 3
		; AVX-NEXT: [[TMP46:%.*]] = insertelement <8 x float> [[TMP45]], float [[TMP21]], i32 4
		; AVX-NEXT: [[TMP47:%.*]] = insertelement <8 x float> [[TMP46]], float [[TMP25]], i32 5
		; AVX-NEXT: [[TMP48:%.*]] = insertelement <8 x float> [[TMP47]], float [[TMP29]], i32 6
		; AVX-NEXT: [[TMP49:%.*]] = insertelement <8 x float> [[TMP48]], float [[TMP33]], i32 7
		; AVX-NEXT: [[TMP50:%.*]] = fdiv <8 x float> [[TMP41]], [[TMP49]]
		; AVX-NEXT: [[TMP51:%.]] = bitcast float [[TMP0:%.]] to <8 x float>
		; AVX-NEXT: store <8 x float> [[TMP50]], <8 x float>* [[TMP51]], align 4, [[TBAA0]]
; AVX-NEXT: ret void		; AVX-NEXT: ret void
;		;
; AVX2-LABEL: @gather_load_div(
; AVX2-NEXT: [[TMP3:%.]] = getelementptr inbounds float, float [[TMP1:%.*]], i64 10
; AVX2-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[TMP1]], i64 3
; AVX2-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP1]], i64 14
; AVX2-NEXT: [[TMP6:%.]] = getelementptr inbounds float, float [[TMP1]], i64 17
; AVX2-NEXT: [[TMP7:%.]] = getelementptr inbounds float, float [[TMP1]], i64 8
; AVX2-NEXT: [[TMP8:%.]] = getelementptr inbounds float, float [[TMP1]], i64 5
; AVX2-NEXT: [[TMP9:%.]] = getelementptr inbounds float, float [[TMP1]], i64 20
; AVX2-NEXT: [[TMP10:%.]] = insertelement <8 x float> undef, float* [[TMP1]], i32 0
; AVX2-NEXT: [[TMP11:%.]] = insertelement <8 x float> [[TMP10]], float* [[TMP3]], i32 1
; AVX2-NEXT: [[TMP12:%.]] = insertelement <8 x float> [[TMP11]], float* [[TMP4]], i32 2
; AVX2-NEXT: [[TMP13:%.]] = insertelement <8 x float> [[TMP12]], float* [[TMP5]], i32 3
; AVX2-NEXT: [[TMP14:%.]] = insertelement <8 x float> [[TMP13]], float* [[TMP6]], i32 4
; AVX2-NEXT: [[TMP15:%.]] = insertelement <8 x float> [[TMP14]], float* [[TMP7]], i32 5
; AVX2-NEXT: [[TMP16:%.]] = insertelement <8 x float> [[TMP15]], float* [[TMP8]], i32 6
; AVX2-NEXT: [[TMP17:%.]] = insertelement <8 x float> [[TMP16]], float* [[TMP9]], i32 7
; AVX2-NEXT: [[TMP18:%.]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float> [[TMP17]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), [[TBAA0]]
; AVX2-NEXT: [[TMP19:%.]] = shufflevector <8 x float> [[TMP10]], <8 x float*> undef, <8 x i32> zeroinitializer
; AVX2-NEXT: [[TMP20:%.]] = getelementptr float, <8 x float> [[TMP19]], <8 x i64> <i64 4, i64 13, i64 11, i64 44, i64 33, i64 30, i64 27, i64 23>
; AVX2-NEXT: [[TMP21:%.]] = call <8 x float> @llvm.masked.gather.v8f32.v8p0f32(<8 x float> [[TMP20]], i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> undef), [[TBAA0]]
; AVX2-NEXT: [[TMP22:%.*]] = fdiv <8 x float> [[TMP18]], [[TMP21]]
; AVX2-NEXT: [[TMP23:%.]] = bitcast float [[TMP0:%.]] to <8 x float>
; AVX2-NEXT: store <8 x float> [[TMP22]], <8 x float>* [[TMP23]], align 4, [[TBAA0]]
; AVX2-NEXT: ret void
;
; AVX512-LABEL: @gather_load_div(		; AVX512-LABEL: @gather_load_div(
; AVX512-NEXT: [[TMP3:%.]] = getelementptr inbounds float, float [[TMP1:%.*]], i64 10		; AVX512-NEXT: [[TMP3:%.]] = getelementptr inbounds float, float [[TMP1:%.*]], i64 10
; AVX512-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[TMP1]], i64 3		; AVX512-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[TMP1]], i64 3
; AVX512-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP1]], i64 14		; AVX512-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP1]], i64 14
; AVX512-NEXT: [[TMP6:%.]] = getelementptr inbounds float, float [[TMP1]], i64 17		; AVX512-NEXT: [[TMP6:%.]] = getelementptr inbounds float, float [[TMP1]], i64 17
; AVX512-NEXT: [[TMP7:%.]] = getelementptr inbounds float, float [[TMP1]], i64 8		; AVX512-NEXT: [[TMP7:%.]] = getelementptr inbounds float, float [[TMP1]], i64 8
; AVX512-NEXT: [[TMP8:%.]] = getelementptr inbounds float, float [[TMP1]], i64 5		; AVX512-NEXT: [[TMP8:%.]] = getelementptr inbounds float, float [[TMP1]], i64 5
; AVX512-NEXT: [[TMP9:%.]] = getelementptr inbounds float, float [[TMP1]], i64 20		; AVX512-NEXT: [[TMP9:%.]] = getelementptr inbounds float, float [[TMP1]], i64 20
▲ Show 20 Lines • Show All 78 Lines • Show Last 20 Lines