This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
1/4
X86TargetTransformInfo.cpp
-
test/
-
Analysis/CostModel/X86/
-
CostModel/
-
X86/
-
interleaved-load-f32-stride-2.ll
-
interleaved-load-f64-stride-2.ll
-
interleaved-load-i16-stride-2.ll
-
interleaved-load-i32-stride-2-indices-0u.ll
-
interleaved-load-i32-stride-2.ll
-
interleaved-load-i64-stride-2.ll
-
Transforms/LoopVectorize/X86/
-
LoopVectorize/
-
X86/
-
interleaving.ll
-
pr47437.ll

Differential D111938

[TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costs
ClosedPublic

Authored by RKSimon on Oct 16 2021, 6:59 AM.

Download Raw Diff

Details

Reviewers

lebedev.ri

Commits

rG6ec644e2157d: [TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load…

Summary

These cases uses the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64.

It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win.

Fixes PR47437

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

RKSimon created this revision.Oct 16 2021, 6:59 AM

Herald added subscribers: pengfei, hiraditya. · View Herald TranscriptOct 16 2021, 6:59 AM

RKSimon requested review of this revision.Oct 16 2021, 6:59 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 16 2021, 6:59 AM

lebedev.ri added inline comments.Oct 16 2021, 7:24 AM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
5224	Looking at `llvm-project/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-2.ll`, VF4 codegen is really different between SSE2 and AVX2.
5230	Looking at `llvm-project/llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-2.ll`, `@load_i32_stride2_vf4` also seems to match.

Harbormaster completed remote builds in B129186: Diff 380180.Oct 16 2021, 7:37 AM

RKSimon added inline comments.Oct 16 2021, 8:11 AM

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
5224	nice catch!
5230	every little helps :)

Address review comment

This revision is now accepted and ready to land.Oct 16 2021, 8:18 AM

This revision was landed with ongoing or failed builds.Oct 16 2021, 8:22 AM

Closed by commit rG6ec644e2157d: [TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load… (authored by RKSimon). · Explain Why

This revision was automatically updated to reflect the committed changes.

RKSimon added a commit: rG6ec644e2157d: [TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load….

Harbormaster completed remote builds in B129192: Diff 380186.Oct 16 2021, 8:56 AM

RKSimon mentioned this in rG85b87179f482: [TTI][X86] Add v8i16 -> 2 x v4i16 stride 2 interleaved load costs.Oct 16 2021, 9:32 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86TargetTransformInfo.cpp

18 lines

test/

Analysis/

CostModel/

X86/

interleaved-load-f32-stride-2.ll

8 lines

interleaved-load-f64-stride-2.ll

4 lines

interleaved-load-i16-stride-2.ll

4 lines

interleaved-load-i32-stride-2-indices-0u.ll

8 lines

interleaved-load-i32-stride-2.ll

8 lines

interleaved-load-i64-stride-2.ll

4 lines

Transforms/

LoopVectorize/

X86/

interleaving.ll

123 lines

pr47437.ll

368 lines

Diff 380187

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 5,214 Lines • ▼ Show 20 Lines	InstructionCost X86TTIImpl::getInterleavedMemoryOpCost(
//		//
static const CostTblEntry AVX2InterleavedLoadTbl[] = {		static const CostTblEntry AVX2InterleavedLoadTbl[] = {
{2, MVT::v2i8, 2}, // (load 4i8 and) deinterleave into 2 x 2i8		{2, MVT::v2i8, 2}, // (load 4i8 and) deinterleave into 2 x 2i8
{2, MVT::v4i8, 2}, // (load 8i8 and) deinterleave into 2 x 4i8		{2, MVT::v4i8, 2}, // (load 8i8 and) deinterleave into 2 x 4i8
{2, MVT::v8i8, 2}, // (load 16i8 and) deinterleave into 2 x 8i8		{2, MVT::v8i8, 2}, // (load 16i8 and) deinterleave into 2 x 8i8
{2, MVT::v16i8, 4}, // (load 32i8 and) deinterleave into 2 x 16i8		{2, MVT::v16i8, 4}, // (load 32i8 and) deinterleave into 2 x 16i8
{2, MVT::v32i8, 6}, // (load 64i8 and) deinterleave into 2 x 32i8		{2, MVT::v32i8, 6}, // (load 64i8 and) deinterleave into 2 x 32i8

{2, MVT::v2i16, 2}, // (load 4i16 and) deinterleave into 2 x 2i16
{2, MVT::v4i16, 2}, // (load 8i16 and) deinterleave into 2 x 4i16		{2, MVT::v4i16, 2}, // (load 8i16 and) deinterleave into 2 x 4i16
lebedev.riUnsubmitted Not Done Reply Inline Actions Looking at `llvm-project/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-2.ll`, VF4 codegen is really different between SSE2 and AVX2. lebedev.ri: Looking at `llvm-project/llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-2.ll`, VF4…
RKSimonAuthorUnsubmitted Not Done Reply Inline Actions nice catch! RKSimon: nice catch!
{2, MVT::v8i16, 6}, // (load 16i16 and) deinterleave into 2 x 8i16		{2, MVT::v8i16, 6}, // (load 16i16 and) deinterleave into 2 x 8i16
{2, MVT::v16i16, 9}, // (load 32i16 and) deinterleave into 2 x 16i16		{2, MVT::v16i16, 9}, // (load 32i16 and) deinterleave into 2 x 16i16
{2, MVT::v32i16, 18}, // (load 64i16 and) deinterleave into 2 x 32i16		{2, MVT::v32i16, 18}, // (load 64i16 and) deinterleave into 2 x 32i16

{2, MVT::v2i32, 2}, // (load 4i32 and) deinterleave into 2 x 2i32
{2, MVT::v4i32, 2}, // (load 8i32 and) deinterleave into 2 x 4i32
lebedev.riUnsubmitted Not Done Reply Inline Actions Looking at `llvm-project/llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-2.ll`, `@load_i32_stride2_vf4` also seems to match. lebedev.ri: Looking at `llvm-project/llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-2.ll`…
RKSimonAuthorUnsubmitted Done Reply Inline Actions every little helps :) RKSimon: every little helps :)
{2, MVT::v8i32, 4}, // (load 16i32 and) deinterleave into 2 x 8i32		{2, MVT::v8i32, 4}, // (load 16i32 and) deinterleave into 2 x 8i32
{2, MVT::v16i32, 8}, // (load 32i32 and) deinterleave into 2 x 16i32		{2, MVT::v16i32, 8}, // (load 32i32 and) deinterleave into 2 x 16i32
{2, MVT::v32i32, 16}, // (load 64i32 and) deinterleave into 2 x 32i32		{2, MVT::v32i32, 16}, // (load 64i32 and) deinterleave into 2 x 32i32

{2, MVT::v2i64, 2}, // (load 4i64 and) deinterleave into 2 x 2i64
{2, MVT::v4i64, 4}, // (load 8i64 and) deinterleave into 2 x 4i64		{2, MVT::v4i64, 4}, // (load 8i64 and) deinterleave into 2 x 4i64
{2, MVT::v8i64, 8}, // (load 16i64 and) deinterleave into 2 x 8i64		{2, MVT::v8i64, 8}, // (load 16i64 and) deinterleave into 2 x 8i64
{2, MVT::v16i64, 16}, // (load 32i64 and) deinterleave into 2 x 16i64		{2, MVT::v16i64, 16}, // (load 32i64 and) deinterleave into 2 x 16i64

{3, MVT::v2i8, 3}, // (load 6i8 and) deinterleave into 3 x 2i8		{3, MVT::v2i8, 3}, // (load 6i8 and) deinterleave into 3 x 2i8
{3, MVT::v4i8, 3}, // (load 12i8 and) deinterleave into 3 x 4i8		{3, MVT::v4i8, 3}, // (load 12i8 and) deinterleave into 3 x 4i8
{3, MVT::v8i8, 6}, // (load 24i8 and) deinterleave into 3 x 8i8		{3, MVT::v8i8, 6}, // (load 24i8 and) deinterleave into 3 x 8i8
{3, MVT::v16i8, 11}, // (load 48i8 and) deinterleave into 3 x 16i8		{3, MVT::v16i8, 11}, // (load 48i8 and) deinterleave into 3 x 16i8
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	static const CostTblEntry AVX2InterleavedLoadTbl[] = {

{6, MVT::v2i64, 6}, // (load 12i64 and) deinterleave into 6 x 2i64		{6, MVT::v2i64, 6}, // (load 12i64 and) deinterleave into 6 x 2i64
{6, MVT::v4i64, 18}, // (load 24i64 and) deinterleave into 6 x 4i64		{6, MVT::v4i64, 18}, // (load 24i64 and) deinterleave into 6 x 4i64
{6, MVT::v8i64, 36}, // (load 48i64 and) deinterleave into 6 x 8i64		{6, MVT::v8i64, 36}, // (load 48i64 and) deinterleave into 6 x 8i64

{8, MVT::v8i32, 40} // (load 64i32 and) deinterleave into 8 x 8i32		{8, MVT::v8i32, 40} // (load 64i32 and) deinterleave into 8 x 8i32
};		};

		static const CostTblEntry SSE2InterleavedLoadTbl[] = {
		{2, MVT::v2i16, 2}, // (load 4i16 and) deinterleave into 2 x 2i16

		{2, MVT::v2i32, 2}, // (load 4i32 and) deinterleave into 2 x 2i32
		{2, MVT::v4i32, 2}, // (load 8i32 and) deinterleave into 2 x 4i32

		{2, MVT::v2i64, 2}, // (load 4i64 and) deinterleave into 2 x 2i64
		};

static const CostTblEntry AVX2InterleavedStoreTbl[] = {		static const CostTblEntry AVX2InterleavedStoreTbl[] = {
{2, MVT::v2i8, 1}, // interleave 2 x 2i8 into 4i8 (and store)		{2, MVT::v2i8, 1}, // interleave 2 x 2i8 into 4i8 (and store)
{2, MVT::v4i8, 1}, // interleave 2 x 4i8 into 8i8 (and store)		{2, MVT::v4i8, 1}, // interleave 2 x 4i8 into 8i8 (and store)
{2, MVT::v8i8, 1}, // interleave 2 x 8i8 into 16i8 (and store)		{2, MVT::v8i8, 1}, // interleave 2 x 8i8 into 16i8 (and store)
{2, MVT::v16i8, 3}, // interleave 2 x 16i8 into 32i8 (and store)		{2, MVT::v16i8, 3}, // interleave 2 x 16i8 into 32i8 (and store)
{2, MVT::v32i8, 4}, // interleave 2 x 32i8 into 64i8 (and store)		{2, MVT::v32i8, 4}, // interleave 2 x 32i8 into 64i8 (and store)

{2, MVT::v2i16, 1}, // interleave 2 x 2i16 into 4i16 (and store)		{2, MVT::v2i16, 1}, // interleave 2 x 2i16 into 4i16 (and store)
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	InstructionCost X86TTIImpl::getInterleavedMemoryOpCost(

if (Opcode == Instruction::Load) {		if (Opcode == Instruction::Load) {
// FIXME: if we have a partially-interleaved groups, with gaps,		// FIXME: if we have a partially-interleaved groups, with gaps,
// should we discount the not-demanded indicies?		// should we discount the not-demanded indicies?
if (ST->hasAVX2())		if (ST->hasAVX2())
if (const auto *Entry = CostTableLookup(AVX2InterleavedLoadTbl, Factor,		if (const auto *Entry = CostTableLookup(AVX2InterleavedLoadTbl, Factor,
ETy.getSimpleVT()))		ETy.getSimpleVT()))
return MemOpCosts + Entry->Cost;		return MemOpCosts + Entry->Cost;

		if (ST->hasSSE2())
		if (const auto *Entry = CostTableLookup(SSE2InterleavedLoadTbl, Factor,
		ETy.getSimpleVT()))
		return MemOpCosts + Entry->Cost;
} else {		} else {
assert(Opcode == Instruction::Store &&		assert(Opcode == Instruction::Store &&
"Expected Store Instruction at this point");		"Expected Store Instruction at this point");
assert((!Indices.size() \|\| Indices.size() == Factor) &&		assert((!Indices.size() \|\| Indices.size() == Factor) &&
"Interleaved store only supports fully-interleaved groups.");		"Interleaved store only supports fully-interleaved groups.");
if (ST->hasAVX2())		if (ST->hasAVX2())
if (const auto *Entry = CostTableLookup(AVX2InterleavedStoreTbl, Factor,		if (const auto *Entry = CostTableLookup(AVX2InterleavedStoreTbl, Factor,
ETy.getSimpleVT()))		ETy.getSimpleVT()))
return MemOpCosts + Entry->Cost;		return MemOpCosts + Entry->Cost;
}		}

return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,		return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
Alignment, AddressSpace, CostKind,		Alignment, AddressSpace, CostKind,
UseMaskForCond, UseMaskForGaps);		UseMaskForCond, UseMaskForGaps);
}		}

llvm/test/Analysis/CostModel/X86/interleaved-load-f32-stride-2.ll

	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512
	; REQUIRES: asserts			; REQUIRES: asserts

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@A = global [1024 x float] zeroinitializer, align 128			@A = global [1024 x float] zeroinitializer, align 128
	@B = global [1024 x i8] zeroinitializer, align 128			@B = global [1024 x i8] zeroinitializer, align 128

	; CHECK: LV: Checking a loop in "test"			; CHECK: LV: Checking a loop in "test"
	;			;
	; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load float, float* %in0, align 4			; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load float, float* %in0, align 4
	; SSE2: LV: Found an estimated cost of 6 for VF 2 For instruction: %v0 = load float, float* %in0, align 4			; SSE2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load float, float* %in0, align 4
	; SSE2: LV: Found an estimated cost of 14 for VF 4 For instruction: %v0 = load float, float* %in0, align 4			; SSE2: LV: Found an estimated cost of 4 for VF 4 For instruction: %v0 = load float, float* %in0, align 4
	; SSE2: LV: Found an estimated cost of 28 for VF 8 For instruction: %v0 = load float, float* %in0, align 4			; SSE2: LV: Found an estimated cost of 28 for VF 8 For instruction: %v0 = load float, float* %in0, align 4
	; SSE2: LV: Found an estimated cost of 56 for VF 16 For instruction: %v0 = load float, float* %in0, align 4			; SSE2: LV: Found an estimated cost of 56 for VF 16 For instruction: %v0 = load float, float* %in0, align 4
	;			;
	; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load float, float* %in0, align 4			; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load float, float* %in0, align 4
	; AVX1: LV: Found an estimated cost of 6 for VF 2 For instruction: %v0 = load float, float* %in0, align 4			; AVX1: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load float, float* %in0, align 4
	; AVX1: LV: Found an estimated cost of 17 for VF 4 For instruction: %v0 = load float, float* %in0, align 4			; AVX1: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load float, float* %in0, align 4
	; AVX1: LV: Found an estimated cost of 38 for VF 8 For instruction: %v0 = load float, float* %in0, align 4			; AVX1: LV: Found an estimated cost of 38 for VF 8 For instruction: %v0 = load float, float* %in0, align 4
	; AVX1: LV: Found an estimated cost of 76 for VF 16 For instruction: %v0 = load float, float* %in0, align 4			; AVX1: LV: Found an estimated cost of 76 for VF 16 For instruction: %v0 = load float, float* %in0, align 4
	; AVX1: LV: Found an estimated cost of 152 for VF 32 For instruction: %v0 = load float, float* %in0, align 4			; AVX1: LV: Found an estimated cost of 152 for VF 32 For instruction: %v0 = load float, float* %in0, align 4
	;;			;;
	; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load float, float* %in0, align 4			; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load float, float* %in0, align 4
	; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load float, float* %in0, align 4			; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load float, float* %in0, align 4
	; AVX2: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load float, float* %in0, align 4			; AVX2: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load float, float* %in0, align 4
	; AVX2: LV: Found an estimated cost of 6 for VF 8 For instruction: %v0 = load float, float* %in0, align 4			; AVX2: LV: Found an estimated cost of 6 for VF 8 For instruction: %v0 = load float, float* %in0, align 4
	▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/X86/interleaved-load-f64-stride-2.ll

	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512
	; REQUIRES: asserts			; REQUIRES: asserts

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@A = global [1024 x double] zeroinitializer, align 128			@A = global [1024 x double] zeroinitializer, align 128
	@B = global [1024 x i8] zeroinitializer, align 128			@B = global [1024 x i8] zeroinitializer, align 128

	; CHECK: LV: Checking a loop in "test"			; CHECK: LV: Checking a loop in "test"
	;			;
	; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load double, double* %in0, align 8			; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load double, double* %in0, align 8
	; SSE2: LV: Found an estimated cost of 6 for VF 2 For instruction: %v0 = load double, double* %in0, align 8			; SSE2: LV: Found an estimated cost of 4 for VF 2 For instruction: %v0 = load double, double* %in0, align 8
	; SSE2: LV: Found an estimated cost of 12 for VF 4 For instruction: %v0 = load double, double* %in0, align 8			; SSE2: LV: Found an estimated cost of 12 for VF 4 For instruction: %v0 = load double, double* %in0, align 8
	; SSE2: LV: Found an estimated cost of 24 for VF 8 For instruction: %v0 = load double, double* %in0, align 8			; SSE2: LV: Found an estimated cost of 24 for VF 8 For instruction: %v0 = load double, double* %in0, align 8
	; SSE2: LV: Found an estimated cost of 48 for VF 16 For instruction: %v0 = load double, double* %in0, align 8			; SSE2: LV: Found an estimated cost of 48 for VF 16 For instruction: %v0 = load double, double* %in0, align 8
	;			;
	; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load double, double* %in0, align 8			; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load double, double* %in0, align 8
	; AVX1: LV: Found an estimated cost of 7 for VF 2 For instruction: %v0 = load double, double* %in0, align 8			; AVX1: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load double, double* %in0, align 8
	; AVX1: LV: Found an estimated cost of 16 for VF 4 For instruction: %v0 = load double, double* %in0, align 8			; AVX1: LV: Found an estimated cost of 16 for VF 4 For instruction: %v0 = load double, double* %in0, align 8
	; AVX1: LV: Found an estimated cost of 32 for VF 8 For instruction: %v0 = load double, double* %in0, align 8			; AVX1: LV: Found an estimated cost of 32 for VF 8 For instruction: %v0 = load double, double* %in0, align 8
	; AVX1: LV: Found an estimated cost of 64 for VF 16 For instruction: %v0 = load double, double* %in0, align 8			; AVX1: LV: Found an estimated cost of 64 for VF 16 For instruction: %v0 = load double, double* %in0, align 8
	; AVX1: LV: Found an estimated cost of 128 for VF 32 For instruction: %v0 = load double, double* %in0, align 8			; AVX1: LV: Found an estimated cost of 128 for VF 32 For instruction: %v0 = load double, double* %in0, align 8
	;;			;;
	; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load double, double* %in0, align 8			; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load double, double* %in0, align 8
	; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load double, double* %in0, align 8			; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load double, double* %in0, align 8
	; AVX2: LV: Found an estimated cost of 6 for VF 4 For instruction: %v0 = load double, double* %in0, align 8			; AVX2: LV: Found an estimated cost of 6 for VF 4 For instruction: %v0 = load double, double* %in0, align 8
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/X86/interleaved-load-i16-stride-2.ll

	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512
	; REQUIRES: asserts			; REQUIRES: asserts

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@A = global [1024 x i16] zeroinitializer, align 128			@A = global [1024 x i16] zeroinitializer, align 128
	@B = global [1024 x i8] zeroinitializer, align 128			@B = global [1024 x i8] zeroinitializer, align 128

	; CHECK: LV: Checking a loop in "test"			; CHECK: LV: Checking a loop in "test"
	;			;
	; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i16, i16* %in0, align 2			; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i16, i16* %in0, align 2
	; SSE2: LV: Found an estimated cost of 9 for VF 2 For instruction: %v0 = load i16, i16* %in0, align 2			; SSE2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i16, i16* %in0, align 2
	; SSE2: LV: Found an estimated cost of 17 for VF 4 For instruction: %v0 = load i16, i16* %in0, align 2			; SSE2: LV: Found an estimated cost of 17 for VF 4 For instruction: %v0 = load i16, i16* %in0, align 2
	; SSE2: LV: Found an estimated cost of 34 for VF 8 For instruction: %v0 = load i16, i16* %in0, align 2			; SSE2: LV: Found an estimated cost of 34 for VF 8 For instruction: %v0 = load i16, i16* %in0, align 2
	; SSE2: LV: Found an estimated cost of 68 for VF 16 For instruction: %v0 = load i16, i16* %in0, align 2			; SSE2: LV: Found an estimated cost of 68 for VF 16 For instruction: %v0 = load i16, i16* %in0, align 2
	;			;
	; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i16, i16* %in0, align 2			; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i16, i16* %in0, align 2
	; AVX1: LV: Found an estimated cost of 9 for VF 2 For instruction: %v0 = load i16, i16* %in0, align 2			; AVX1: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i16, i16* %in0, align 2
	; AVX1: LV: Found an estimated cost of 17 for VF 4 For instruction: %v0 = load i16, i16* %in0, align 2			; AVX1: LV: Found an estimated cost of 17 for VF 4 For instruction: %v0 = load i16, i16* %in0, align 2
	; AVX1: LV: Found an estimated cost of 41 for VF 8 For instruction: %v0 = load i16, i16* %in0, align 2			; AVX1: LV: Found an estimated cost of 41 for VF 8 For instruction: %v0 = load i16, i16* %in0, align 2
	; AVX1: LV: Found an estimated cost of 86 for VF 16 For instruction: %v0 = load i16, i16* %in0, align 2			; AVX1: LV: Found an estimated cost of 86 for VF 16 For instruction: %v0 = load i16, i16* %in0, align 2
	; AVX1: LV: Found an estimated cost of 172 for VF 32 For instruction: %v0 = load i16, i16* %in0, align 2			; AVX1: LV: Found an estimated cost of 172 for VF 32 For instruction: %v0 = load i16, i16* %in0, align 2
	;			;
	; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i16, i16* %in0, align 2			; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i16, i16* %in0, align 2
	; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i16, i16* %in0, align 2			; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i16, i16* %in0, align 2
	; AVX2: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i16, i16* %in0, align 2			; AVX2: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i16, i16* %in0, align 2
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-2-indices-0u.ll

	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512
	; REQUIRES: asserts			; REQUIRES: asserts

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@A = global [1024 x i32] zeroinitializer, align 128			@A = global [1024 x i32] zeroinitializer, align 128
	@B = global [1024 x i8] zeroinitializer, align 128			@B = global [1024 x i8] zeroinitializer, align 128

	; CHECK: LV: Checking a loop in "test"			; CHECK: LV: Checking a loop in "test"
	;			;
	; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; SSE2: LV: Found an estimated cost of 7 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; SSE2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; SSE2: LV: Found an estimated cost of 15 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; SSE2: LV: Found an estimated cost of 4 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; SSE2: LV: Found an estimated cost of 30 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; SSE2: LV: Found an estimated cost of 30 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; SSE2: LV: Found an estimated cost of 60 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; SSE2: LV: Found an estimated cost of 60 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 5 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 11 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 24 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 24 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 48 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 48 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 96 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 96 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 6 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 6 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	Show All 40 Lines

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-2.ll

	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512
	; REQUIRES: asserts			; REQUIRES: asserts

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@A = global [1024 x i32] zeroinitializer, align 128			@A = global [1024 x i32] zeroinitializer, align 128
	@B = global [1024 x i8] zeroinitializer, align 128			@B = global [1024 x i8] zeroinitializer, align 128

	; CHECK: LV: Checking a loop in "test"			; CHECK: LV: Checking a loop in "test"
	;			;
	; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; SSE2: LV: Found an estimated cost of 14 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; SSE2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; SSE2: LV: Found an estimated cost of 30 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; SSE2: LV: Found an estimated cost of 4 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; SSE2: LV: Found an estimated cost of 60 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; SSE2: LV: Found an estimated cost of 60 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; SSE2: LV: Found an estimated cost of 120 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; SSE2: LV: Found an estimated cost of 120 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 9 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 21 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 46 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 46 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 92 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 92 for VF 16 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX1: LV: Found an estimated cost of 184 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX1: LV: Found an estimated cost of 184 for VF 32 For instruction: %v0 = load i32, i32* %in0, align 4
	;			;
	; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 3 for VF 4 For instruction: %v0 = load i32, i32* %in0, align 4
	; AVX2: LV: Found an estimated cost of 6 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4			; AVX2: LV: Found an estimated cost of 6 for VF 8 For instruction: %v0 = load i32, i32* %in0, align 4
	▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/X86/interleaved-load-i64-stride-2.ll

	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+sse2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,SSE2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX1
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx2 --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX2
	; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512			; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -S -mattr=+avx512bw,+avx512vl --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,AVX512
	; REQUIRES: asserts			; REQUIRES: asserts

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@A = global [1024 x i64] zeroinitializer, align 128			@A = global [1024 x i64] zeroinitializer, align 128
	@B = global [1024 x i8] zeroinitializer, align 128			@B = global [1024 x i8] zeroinitializer, align 128

	; CHECK: LV: Checking a loop in "test"			; CHECK: LV: Checking a loop in "test"
	;			;
	; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i64, i64* %in0, align 8			; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i64, i64* %in0, align 8
	; SSE2: LV: Found an estimated cost of 14 for VF 2 For instruction: %v0 = load i64, i64* %in0, align 8			; SSE2: LV: Found an estimated cost of 4 for VF 2 For instruction: %v0 = load i64, i64* %in0, align 8
	; SSE2: LV: Found an estimated cost of 28 for VF 4 For instruction: %v0 = load i64, i64* %in0, align 8			; SSE2: LV: Found an estimated cost of 28 for VF 4 For instruction: %v0 = load i64, i64* %in0, align 8
	; SSE2: LV: Found an estimated cost of 56 for VF 8 For instruction: %v0 = load i64, i64* %in0, align 8			; SSE2: LV: Found an estimated cost of 56 for VF 8 For instruction: %v0 = load i64, i64* %in0, align 8
	; SSE2: LV: Found an estimated cost of 112 for VF 16 For instruction: %v0 = load i64, i64* %in0, align 8			; SSE2: LV: Found an estimated cost of 112 for VF 16 For instruction: %v0 = load i64, i64* %in0, align 8
	;			;
	; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i64, i64* %in0, align 8			; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i64, i64* %in0, align 8
	; AVX1: LV: Found an estimated cost of 11 for VF 2 For instruction: %v0 = load i64, i64* %in0, align 8			; AVX1: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i64, i64* %in0, align 8
	; AVX1: LV: Found an estimated cost of 26 for VF 4 For instruction: %v0 = load i64, i64* %in0, align 8			; AVX1: LV: Found an estimated cost of 26 for VF 4 For instruction: %v0 = load i64, i64* %in0, align 8
	; AVX1: LV: Found an estimated cost of 52 for VF 8 For instruction: %v0 = load i64, i64* %in0, align 8			; AVX1: LV: Found an estimated cost of 52 for VF 8 For instruction: %v0 = load i64, i64* %in0, align 8
	; AVX1: LV: Found an estimated cost of 104 for VF 16 For instruction: %v0 = load i64, i64* %in0, align 8			; AVX1: LV: Found an estimated cost of 104 for VF 16 For instruction: %v0 = load i64, i64* %in0, align 8
	; AVX1: LV: Found an estimated cost of 208 for VF 32 For instruction: %v0 = load i64, i64* %in0, align 8			; AVX1: LV: Found an estimated cost of 208 for VF 32 For instruction: %v0 = load i64, i64* %in0, align 8
	;			;
	; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i64, i64* %in0, align 8			; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction: %v0 = load i64, i64* %in0, align 8
	; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i64, i64* %in0, align 8			; AVX2: LV: Found an estimated cost of 3 for VF 2 For instruction: %v0 = load i64, i64* %in0, align 8
	; AVX2: LV: Found an estimated cost of 6 for VF 4 For instruction: %v0 = load i64, i64* %in0, align 8			; AVX2: LV: Found an estimated cost of 6 for VF 4 For instruction: %v0 = load i64, i64* %in0, align 8
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/interleaving.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine < %s \| FileCheck %s --check-prefix=SSE			; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine < %s \| FileCheck %s --check-prefix=SSE
	; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=sandybridge < %s \| FileCheck %s --check-prefix=AVX1			; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=sandybridge < %s \| FileCheck %s --check-prefix=AVX1
	; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=haswell < %s \| FileCheck %s --check-prefix=AVX2			; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=haswell < %s \| FileCheck %s --check-prefix=AVX2
	; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=slm < %s \| FileCheck %s --check-prefix=SSE			; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=slm < %s \| FileCheck %s --check-prefix=SSE
	; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=atom < %s \| FileCheck %s --check-prefix=SSE			; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=atom < %s \| FileCheck %s --check-prefix=ATOM

	define void @foo(i32* noalias nocapture %a, i32* noalias nocapture readonly %b) {			define void @foo(i32* noalias nocapture %a, i32* noalias nocapture readonly %b) {
	; SSE-LABEL: @foo(			; SSE-LABEL: @foo(
	; SSE-NEXT: entry:			; SSE-NEXT: entry:
				; SSE-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; SSE: vector.ph:
				; SSE-NEXT: br label [[VECTOR_BODY:%.*]]
				; SSE: vector.body:
				; SSE-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; SSE-NEXT: [[TMP0:%.*]] = shl nsw i64 [[INDEX]], 1
				; SSE-NEXT: [[TMP1:%.*]] = shl i64 [[INDEX]], 1
				; SSE-NEXT: [[TMP2:%.*]] = or i64 [[TMP1]], 8
				; SSE-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[TMP0]]
				; SSE-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[TMP2]]
				; SSE-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP3]] to <8 x i32>*
				; SSE-NEXT: [[TMP6:%.]] = bitcast i32 [[TMP4]] to <8 x i32>*
				; SSE-NEXT: [[WIDE_VEC:%.]] = load <8 x i32>, <8 x i32> [[TMP5]], align 4
				; SSE-NEXT: [[WIDE_VEC1:%.]] = load <8 x i32>, <8 x i32> [[TMP6]], align 4
				; SSE-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				; SSE-NEXT: [[STRIDED_VEC2:%.*]] = shufflevector <8 x i32> [[WIDE_VEC1]], <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				; SSE-NEXT: [[STRIDED_VEC3:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				; SSE-NEXT: [[STRIDED_VEC4:%.*]] = shufflevector <8 x i32> [[WIDE_VEC1]], <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				; SSE-NEXT: [[TMP7:%.*]] = add nsw <4 x i32> [[STRIDED_VEC3]], [[STRIDED_VEC]]
				; SSE-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[STRIDED_VEC4]], [[STRIDED_VEC2]]
				; SSE-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
				; SSE-NEXT: [[TMP10:%.]] = bitcast i32 [[TMP9]] to <4 x i32>*
				; SSE-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP10]], align 4
				; SSE-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[TMP9]], i64 4
				; SSE-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <4 x i32>*
				; SSE-NEXT: store <4 x i32> [[TMP8]], <4 x i32>* [[TMP12]], align 4
				; SSE-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
				; SSE-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024
				; SSE-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
				; SSE: middle.block:
				; SSE-NEXT: br i1 true, label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
				; SSE: scalar.ph:
	; SSE-NEXT: br label [[FOR_BODY:%.*]]			; SSE-NEXT: br label [[FOR_BODY:%.*]]
	; SSE: for.cond.cleanup:			; SSE: for.cond.cleanup:
	; SSE-NEXT: ret void			; SSE-NEXT: ret void
	; SSE: for.body:			; SSE: for.body:
	; SSE-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]			; SSE-NEXT: br i1 undef, label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
	; SSE-NEXT: [[TMP0:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 1
	; SSE-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[TMP0]]
	; SSE-NEXT: [[TMP1:%.]] = load i32, i32 [[ARRAYIDX]], align 4
	; SSE-NEXT: [[TMP2:%.*]] = or i64 [[TMP0]], 1
	; SSE-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[TMP2]]
	; SSE-NEXT: [[TMP3:%.]] = load i32, i32 [[ARRAYIDX3]], align 4
	; SSE-NEXT: [[ADD4:%.*]] = add nsw i32 [[TMP3]], [[TMP1]]
	; SSE-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDVARS_IV]]
	; SSE-NEXT: store i32 [[ADD4]], i32* [[ARRAYIDX6]], align 4
	; SSE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; SSE-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 1024
	; SSE-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
	;			;
	; AVX1-LABEL: @foo(			; AVX1-LABEL: @foo(
	; AVX1-NEXT: entry:			; AVX1-NEXT: entry:
	; AVX1-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; AVX1-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; AVX1: vector.ph:			; AVX1: vector.ph:
	; AVX1-NEXT: br label [[VECTOR_BODY:%.*]]			; AVX1-NEXT: br label [[VECTOR_BODY:%.*]]
	; AVX1: vector.body:			; AVX1: vector.body:
	; AVX1-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; AVX1-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; AVX1-NEXT: [[TMP0:%.*]] = shl nsw i64 [[INDEX]], 1			; AVX1-NEXT: [[TMP0:%.*]] = shl nsw i64 [[INDEX]], 1
	; AVX1-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[TMP0]]			; AVX1-NEXT: [[TMP1:%.*]] = shl i64 [[INDEX]], 1
	; AVX1-NEXT: [[TMP2:%.]] = bitcast i32 [[TMP1]] to <8 x i32>*			; AVX1-NEXT: [[TMP2:%.*]] = or i64 [[TMP1]], 8
	; AVX1-NEXT: [[WIDE_VEC:%.]] = load <8 x i32>, <8 x i32> [[TMP2]], align 4			; AVX1-NEXT: [[TMP3:%.*]] = shl i64 [[INDEX]], 1
				; AVX1-NEXT: [[TMP4:%.*]] = or i64 [[TMP3]], 16
				; AVX1-NEXT: [[TMP5:%.*]] = shl i64 [[INDEX]], 1
				; AVX1-NEXT: [[TMP6:%.*]] = or i64 [[TMP5]], 24
				; AVX1-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[TMP0]]
				; AVX1-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[TMP2]]
				; AVX1-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[TMP4]]
				; AVX1-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[TMP6]]
				; AVX1-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP7]] to <8 x i32>*
				; AVX1-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP8]] to <8 x i32>*
				; AVX1-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP9]] to <8 x i32>*
				; AVX1-NEXT: [[TMP14:%.]] = bitcast i32 [[TMP10]] to <8 x i32>*
				; AVX1-NEXT: [[WIDE_VEC:%.]] = load <8 x i32>, <8 x i32> [[TMP11]], align 4
				; AVX1-NEXT: [[WIDE_VEC1:%.]] = load <8 x i32>, <8 x i32> [[TMP12]], align 4
				; AVX1-NEXT: [[WIDE_VEC2:%.]] = load <8 x i32>, <8 x i32> [[TMP13]], align 4
				; AVX1-NEXT: [[WIDE_VEC3:%.]] = load <8 x i32>, <8 x i32> [[TMP14]], align 4
	; AVX1-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; AVX1-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; AVX1-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>			; AVX1-NEXT: [[STRIDED_VEC4:%.*]] = shufflevector <8 x i32> [[WIDE_VEC1]], <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; AVX1-NEXT: [[TMP3:%.*]] = add nsw <4 x i32> [[STRIDED_VEC1]], [[STRIDED_VEC]]			; AVX1-NEXT: [[STRIDED_VEC5:%.*]] = shufflevector <8 x i32> [[WIDE_VEC2]], <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; AVX1-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]			; AVX1-NEXT: [[STRIDED_VEC6:%.*]] = shufflevector <8 x i32> [[WIDE_VEC3]], <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; AVX1-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP4]] to <4 x i32>*			; AVX1-NEXT: [[STRIDED_VEC7:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
	; AVX1-NEXT: store <4 x i32> [[TMP3]], <4 x i32>* [[TMP5]], align 4			; AVX1-NEXT: [[STRIDED_VEC8:%.*]] = shufflevector <8 x i32> [[WIDE_VEC1]], <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
	; AVX1-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4			; AVX1-NEXT: [[STRIDED_VEC9:%.*]] = shufflevector <8 x i32> [[WIDE_VEC2]], <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
	; AVX1-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024			; AVX1-NEXT: [[STRIDED_VEC10:%.*]] = shufflevector <8 x i32> [[WIDE_VEC3]], <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
	; AVX1-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; AVX1-NEXT: [[TMP15:%.*]] = add nsw <4 x i32> [[STRIDED_VEC7]], [[STRIDED_VEC]]
				; AVX1-NEXT: [[TMP16:%.*]] = add nsw <4 x i32> [[STRIDED_VEC8]], [[STRIDED_VEC4]]
				; AVX1-NEXT: [[TMP17:%.*]] = add nsw <4 x i32> [[STRIDED_VEC9]], [[STRIDED_VEC5]]
				; AVX1-NEXT: [[TMP18:%.*]] = add nsw <4 x i32> [[STRIDED_VEC10]], [[STRIDED_VEC6]]
				; AVX1-NEXT: [[TMP19:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
				; AVX1-NEXT: [[TMP20:%.]] = bitcast i32 [[TMP19]] to <4 x i32>*
				; AVX1-NEXT: store <4 x i32> [[TMP15]], <4 x i32>* [[TMP20]], align 4
				; AVX1-NEXT: [[TMP21:%.]] = getelementptr inbounds i32, i32 [[TMP19]], i64 4
				; AVX1-NEXT: [[TMP22:%.]] = bitcast i32 [[TMP21]] to <4 x i32>*
				; AVX1-NEXT: store <4 x i32> [[TMP16]], <4 x i32>* [[TMP22]], align 4
				; AVX1-NEXT: [[TMP23:%.]] = getelementptr inbounds i32, i32 [[TMP19]], i64 8
				; AVX1-NEXT: [[TMP24:%.]] = bitcast i32 [[TMP23]] to <4 x i32>*
				; AVX1-NEXT: store <4 x i32> [[TMP17]], <4 x i32>* [[TMP24]], align 4
				; AVX1-NEXT: [[TMP25:%.]] = getelementptr inbounds i32, i32 [[TMP19]], i64 12
				; AVX1-NEXT: [[TMP26:%.]] = bitcast i32 [[TMP25]] to <4 x i32>*
				; AVX1-NEXT: store <4 x i32> [[TMP18]], <4 x i32>* [[TMP26]], align 4
				; AVX1-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
				; AVX1-NEXT: [[TMP27:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024
				; AVX1-NEXT: br i1 [[TMP27]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; AVX1: middle.block:			; AVX1: middle.block:
	; AVX1-NEXT: br i1 true, label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]			; AVX1-NEXT: br i1 true, label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
	; AVX1: scalar.ph:			; AVX1: scalar.ph:
	; AVX1-NEXT: br label [[FOR_BODY:%.*]]			; AVX1-NEXT: br label [[FOR_BODY:%.*]]
	; AVX1: for.cond.cleanup:			; AVX1: for.cond.cleanup:
	; AVX1-NEXT: ret void			; AVX1-NEXT: ret void
	; AVX1: for.body:			; AVX1: for.body:
	; AVX1-NEXT: br i1 undef, label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]			; AVX1-NEXT: br i1 undef, label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
	▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: br i1 true, label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]			; AVX2-NEXT: br i1 true, label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
	; AVX2: scalar.ph:			; AVX2: scalar.ph:
	; AVX2-NEXT: br label [[FOR_BODY:%.*]]			; AVX2-NEXT: br label [[FOR_BODY:%.*]]
	; AVX2: for.cond.cleanup:			; AVX2: for.cond.cleanup:
	; AVX2-NEXT: ret void			; AVX2-NEXT: ret void
	; AVX2: for.body:			; AVX2: for.body:
	; AVX2-NEXT: br i1 undef, label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]			; AVX2-NEXT: br i1 undef, label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
	;			;
				; ATOM-LABEL: @foo(
				; ATOM-NEXT: entry:
				; ATOM-NEXT: br label [[FOR_BODY:%.*]]
				; ATOM: for.cond.cleanup:
				; ATOM-NEXT: ret void
				; ATOM: for.body:
				; ATOM-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
				; ATOM-NEXT: [[TMP0:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 1
				; ATOM-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[TMP0]]
				; ATOM-NEXT: [[TMP1:%.]] = load i32, i32 [[ARRAYIDX]], align 4
				; ATOM-NEXT: [[TMP2:%.*]] = or i64 [[TMP0]], 1
				; ATOM-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[TMP2]]
				; ATOM-NEXT: [[TMP3:%.]] = load i32, i32 [[ARRAYIDX3]], align 4
				; ATOM-NEXT: [[ADD4:%.*]] = add nsw i32 [[TMP3]], [[TMP1]]
				; ATOM-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDVARS_IV]]
				; ATOM-NEXT: store i32 [[ADD4]], i32* [[ARRAYIDX6]], align 4
				; ATOM-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; ATOM-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 1024
				; ATOM-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
				;
	entry:			entry:
	br label %for.body			br label %for.body

	for.cond.cleanup: ; preds = %for.body			for.cond.cleanup: ; preds = %for.body
	ret void			ret void

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	Show All 13 Lines

llvm/test/Transforms/LoopVectorize/X86/pr47437.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -S -loop-vectorize -mtriple=x86_64-- -mattr=+sse2 \| FileCheck %s --check-prefix=SSE2			; RUN: opt < %s -S -loop-vectorize -mtriple=x86_64-- -mattr=+sse2 \| FileCheck %s --check-prefix=SSE2
	; RUN: opt < %s -S -loop-vectorize -mtriple=x86_64-- -mattr=+sse4.1 \| FileCheck %s --check-prefix=SSE41			; RUN: opt < %s -S -loop-vectorize -mtriple=x86_64-- -mattr=+sse4.1 \| FileCheck %s --check-prefix=SSE41
	; RUN: opt < %s -S -loop-vectorize -mtriple=x86_64-- -mattr=+avx \| FileCheck %s --check-prefix=AVX1			; RUN: opt < %s -S -loop-vectorize -mtriple=x86_64-- -mattr=+avx \| FileCheck %s --check-prefix=AVX1
	; RUN: opt < %s -S -loop-vectorize -mtriple=x86_64-- -mattr=+avx2 \| FileCheck %s --check-prefix=AVX2			; RUN: opt < %s -S -loop-vectorize -mtriple=x86_64-- -mattr=+avx2 \| FileCheck %s --check-prefix=AVX2
	; RUN: opt < %s -S -loop-vectorize -mtriple=x86_64-- -mcpu=slm \| FileCheck %s --check-prefix=SSE2			; RUN: opt < %s -S -loop-vectorize -mtriple=x86_64-- -mcpu=slm \| FileCheck %s --check-prefix=SSE2

	define void @test_muladd(i32* noalias nocapture %d1, i16* noalias nocapture readonly %s1, i16* noalias nocapture readonly %s2, i32 %n) {			define void @test_muladd(i32* noalias nocapture %d1, i16* noalias nocapture readonly %s1, i16* noalias nocapture readonly %s2, i32 %n) {
	; SSE2-LABEL: @test_muladd(			; SSE2-LABEL: @test_muladd(
	; SSE2-NEXT: entry:			; SSE2-NEXT: entry:
	; SSE2-NEXT: [[CMP30:%.]] = icmp sgt i32 [[N:%.]], 0			; SSE2-NEXT: [[CMP30:%.]] = icmp sgt i32 [[N:%.]], 0
	; SSE2-NEXT: br i1 [[CMP30]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_END:%.]]			; SSE2-NEXT: br i1 [[CMP30]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_END:%.]]
	; SSE2: for.body.preheader:			; SSE2: for.body.preheader:
	; SSE2-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext i32 [[N]] to i64			; SSE2-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext i32 [[N]] to i64
				; SSE2-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 2
				; SSE2-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; SSE2: vector.ph:
				; SSE2-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 2
				; SSE2-NEXT: [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
				; SSE2-NEXT: br label [[VECTOR_BODY:%.*]]
				; SSE2: vector.body:
				; SSE2-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; SSE2-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
				; SSE2-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 1
				; SSE2-NEXT: [[TMP2:%.]] = getelementptr inbounds i16, i16 [[S1:%.*]], i64 [[TMP1]]
				; SSE2-NEXT: [[TMP3:%.]] = getelementptr inbounds i16, i16 [[TMP2]], i32 0
				; SSE2-NEXT: [[TMP4:%.]] = bitcast i16 [[TMP3]] to <4 x i16>*
				; SSE2-NEXT: [[WIDE_VEC:%.]] = load <4 x i16>, <4 x i16> [[TMP4]], align 2
				; SSE2-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <4 x i16> [[WIDE_VEC]], <4 x i16> poison, <2 x i32> <i32 0, i32 2>
				; SSE2-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <4 x i16> [[WIDE_VEC]], <4 x i16> poison, <2 x i32> <i32 1, i32 3>
				; SSE2-NEXT: [[TMP5:%.*]] = sext <2 x i16> [[STRIDED_VEC]] to <2 x i32>
				; SSE2-NEXT: [[TMP6:%.]] = getelementptr inbounds i16, i16 [[S2:%.*]], i64 [[TMP1]]
				; SSE2-NEXT: [[TMP7:%.]] = getelementptr inbounds i16, i16 [[TMP6]], i32 0
				; SSE2-NEXT: [[TMP8:%.]] = bitcast i16 [[TMP7]] to <4 x i16>*
				; SSE2-NEXT: [[WIDE_VEC2:%.]] = load <4 x i16>, <4 x i16> [[TMP8]], align 2
				; SSE2-NEXT: [[STRIDED_VEC3:%.*]] = shufflevector <4 x i16> [[WIDE_VEC2]], <4 x i16> poison, <2 x i32> <i32 0, i32 2>
				; SSE2-NEXT: [[STRIDED_VEC4:%.*]] = shufflevector <4 x i16> [[WIDE_VEC2]], <4 x i16> poison, <2 x i32> <i32 1, i32 3>
				; SSE2-NEXT: [[TMP9:%.*]] = sext <2 x i16> [[STRIDED_VEC3]] to <2 x i32>
				; SSE2-NEXT: [[TMP10:%.*]] = mul nsw <2 x i32> [[TMP9]], [[TMP5]]
				; SSE2-NEXT: [[TMP11:%.*]] = or i64 [[TMP1]], 1
				; SSE2-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP11]]
				; SSE2-NEXT: [[TMP13:%.*]] = sext <2 x i16> [[STRIDED_VEC1]] to <2 x i32>
				; SSE2-NEXT: [[TMP14:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP11]]
				; SSE2-NEXT: [[TMP15:%.*]] = sext <2 x i16> [[STRIDED_VEC4]] to <2 x i32>
				; SSE2-NEXT: [[TMP16:%.*]] = mul nsw <2 x i32> [[TMP15]], [[TMP13]]
				; SSE2-NEXT: [[TMP17:%.*]] = add nsw <2 x i32> [[TMP16]], [[TMP10]]
				; SSE2-NEXT: [[TMP18:%.]] = getelementptr inbounds i32, i32 [[D1:%.*]], i64 [[TMP0]]
				; SSE2-NEXT: [[TMP19:%.]] = getelementptr inbounds i32, i32 [[TMP18]], i32 0
				; SSE2-NEXT: [[TMP20:%.]] = bitcast i32 [[TMP19]] to <2 x i32>*
				; SSE2-NEXT: store <2 x i32> [[TMP17]], <2 x i32>* [[TMP20]], align 4
				; SSE2-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
				; SSE2-NEXT: [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; SSE2-NEXT: br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
				; SSE2: middle.block:
				; SSE2-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
				; SSE2-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
				; SSE2: scalar.ph:
				; SSE2-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; SSE2-NEXT: br label [[FOR_BODY:%.*]]			; SSE2-NEXT: br label [[FOR_BODY:%.*]]
	; SSE2: for.body:			; SSE2: for.body:
	; SSE2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; SSE2-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; SSE2-NEXT: [[TMP0:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 1			; SSE2-NEXT: [[TMP22:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 1
	; SSE2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[S1:%.*]], i64 [[TMP0]]			; SSE2-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP22]]
	; SSE2-NEXT: [[TMP1:%.]] = load i16, i16 [[ARRAYIDX]], align 2			; SSE2-NEXT: [[TMP23:%.]] = load i16, i16 [[ARRAYIDX]], align 2
	; SSE2-NEXT: [[CONV:%.*]] = sext i16 [[TMP1]] to i32			; SSE2-NEXT: [[CONV:%.*]] = sext i16 [[TMP23]] to i32
	; SSE2-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i16, i16 [[S2:%.*]], i64 [[TMP0]]			; SSE2-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP22]]
	; SSE2-NEXT: [[TMP2:%.]] = load i16, i16 [[ARRAYIDX4]], align 2			; SSE2-NEXT: [[TMP24:%.]] = load i16, i16 [[ARRAYIDX4]], align 2
	; SSE2-NEXT: [[CONV5:%.*]] = sext i16 [[TMP2]] to i32			; SSE2-NEXT: [[CONV5:%.*]] = sext i16 [[TMP24]] to i32
	; SSE2-NEXT: [[MUL6:%.*]] = mul nsw i32 [[CONV5]], [[CONV]]			; SSE2-NEXT: [[MUL6:%.*]] = mul nsw i32 [[CONV5]], [[CONV]]
	; SSE2-NEXT: [[TMP3:%.*]] = or i64 [[TMP0]], 1			; SSE2-NEXT: [[TMP25:%.*]] = or i64 [[TMP22]], 1
	; SSE2-NEXT: [[ARRAYIDX10:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP3]]			; SSE2-NEXT: [[ARRAYIDX10:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP25]]
	; SSE2-NEXT: [[TMP4:%.]] = load i16, i16 [[ARRAYIDX10]], align 2			; SSE2-NEXT: [[TMP26:%.]] = load i16, i16 [[ARRAYIDX10]], align 2
	; SSE2-NEXT: [[CONV11:%.*]] = sext i16 [[TMP4]] to i32			; SSE2-NEXT: [[CONV11:%.*]] = sext i16 [[TMP26]] to i32
	; SSE2-NEXT: [[ARRAYIDX15:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP3]]			; SSE2-NEXT: [[ARRAYIDX15:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP25]]
	; SSE2-NEXT: [[TMP5:%.]] = load i16, i16 [[ARRAYIDX15]], align 2			; SSE2-NEXT: [[TMP27:%.]] = load i16, i16 [[ARRAYIDX15]], align 2
	; SSE2-NEXT: [[CONV16:%.*]] = sext i16 [[TMP5]] to i32			; SSE2-NEXT: [[CONV16:%.*]] = sext i16 [[TMP27]] to i32
	; SSE2-NEXT: [[MUL17:%.*]] = mul nsw i32 [[CONV16]], [[CONV11]]			; SSE2-NEXT: [[MUL17:%.*]] = mul nsw i32 [[CONV16]], [[CONV11]]
	; SSE2-NEXT: [[ADD18:%.*]] = add nsw i32 [[MUL17]], [[MUL6]]			; SSE2-NEXT: [[ADD18:%.*]] = add nsw i32 [[MUL17]], [[MUL6]]
	; SSE2-NEXT: [[ARRAYIDX20:%.]] = getelementptr inbounds i32, i32 [[D1:%.*]], i64 [[INDVARS_IV]]			; SSE2-NEXT: [[ARRAYIDX20:%.]] = getelementptr inbounds i32, i32 [[D1]], i64 [[INDVARS_IV]]
	; SSE2-NEXT: store i32 [[ADD18]], i32* [[ARRAYIDX20]], align 4			; SSE2-NEXT: store i32 [[ADD18]], i32* [[ARRAYIDX20]], align 4
	; SSE2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; SSE2-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; SSE2-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; SSE2-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; SSE2-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END_LOOPEXIT:%.*]], label [[FOR_BODY]]			; SSE2-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
	; SSE2: for.end.loopexit:			; SSE2: for.end.loopexit:
	; SSE2-NEXT: br label [[FOR_END]]			; SSE2-NEXT: br label [[FOR_END]]
	; SSE2: for.end:			; SSE2: for.end:
	; SSE2-NEXT: ret void			; SSE2-NEXT: ret void
	;			;
	; SSE41-LABEL: @test_muladd(			; SSE41-LABEL: @test_muladd(
	; SSE41-NEXT: entry:			; SSE41-NEXT: entry:
	; SSE41-NEXT: [[CMP30:%.]] = icmp sgt i32 [[N:%.]], 0			; SSE41-NEXT: [[CMP30:%.]] = icmp sgt i32 [[N:%.]], 0
	; SSE41-NEXT: br i1 [[CMP30]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_END:%.]]			; SSE41-NEXT: br i1 [[CMP30]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_END:%.]]
	; SSE41: for.body.preheader:			; SSE41: for.body.preheader:
	; SSE41-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext i32 [[N]] to i64			; SSE41-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext i32 [[N]] to i64
	; SSE41-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 4			; SSE41-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 4
	; SSE41-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; SSE41-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; SSE41: vector.ph:			; SSE41: vector.ph:
	; SSE41-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 4			; SSE41-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 4
	; SSE41-NEXT: [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]			; SSE41-NEXT: [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
	; SSE41-NEXT: br label [[VECTOR_BODY:%.*]]			; SSE41-NEXT: br label [[VECTOR_BODY:%.*]]
	; SSE41: vector.body:			; SSE41: vector.body:
	; SSE41-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; SSE41-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; SSE41-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; SSE41-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; SSE41-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 1			; SSE41-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 2
	; SSE41-NEXT: [[TMP2:%.]] = getelementptr inbounds i16, i16 [[S1:%.*]], i64 [[TMP1]]			; SSE41-NEXT: [[TMP2:%.*]] = shl nuw nsw i64 [[TMP0]], 1
	; SSE41-NEXT: [[TMP3:%.]] = getelementptr inbounds i16, i16 [[TMP2]], i32 0			; SSE41-NEXT: [[TMP3:%.*]] = shl nuw nsw i64 [[TMP1]], 1
	; SSE41-NEXT: [[TMP4:%.]] = bitcast i16 [[TMP3]] to <8 x i16>*			; SSE41-NEXT: [[TMP4:%.]] = getelementptr inbounds i16, i16 [[S1:%.*]], i64 [[TMP2]]
	; SSE41-NEXT: [[WIDE_VEC:%.]] = load <8 x i16>, <8 x i16> [[TMP4]], align 2			; SSE41-NEXT: [[TMP5:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP3]]
	; SSE41-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <8 x i16> [[WIDE_VEC]], <8 x i16> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; SSE41-NEXT: [[TMP6:%.]] = getelementptr inbounds i16, i16 [[TMP4]], i32 0
	; SSE41-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <8 x i16> [[WIDE_VEC]], <8 x i16> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>			; SSE41-NEXT: [[TMP7:%.]] = bitcast i16 [[TMP6]] to <4 x i16>*
	; SSE41-NEXT: [[TMP5:%.*]] = sext <4 x i16> [[STRIDED_VEC]] to <4 x i32>			; SSE41-NEXT: [[TMP8:%.]] = getelementptr inbounds i16, i16 [[TMP5]], i32 0
	; SSE41-NEXT: [[TMP6:%.]] = getelementptr inbounds i16, i16 [[S2:%.*]], i64 [[TMP1]]			; SSE41-NEXT: [[TMP9:%.]] = bitcast i16 [[TMP8]] to <4 x i16>*
	; SSE41-NEXT: [[TMP7:%.]] = getelementptr inbounds i16, i16 [[TMP6]], i32 0			; SSE41-NEXT: [[WIDE_VEC:%.]] = load <4 x i16>, <4 x i16> [[TMP7]], align 2
	; SSE41-NEXT: [[TMP8:%.]] = bitcast i16 [[TMP7]] to <8 x i16>*			; SSE41-NEXT: [[WIDE_VEC1:%.]] = load <4 x i16>, <4 x i16> [[TMP9]], align 2
	; SSE41-NEXT: [[WIDE_VEC2:%.]] = load <8 x i16>, <8 x i16> [[TMP8]], align 2			; SSE41-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <4 x i16> [[WIDE_VEC]], <4 x i16> poison, <2 x i32> <i32 0, i32 2>
	; SSE41-NEXT: [[STRIDED_VEC3:%.*]] = shufflevector <8 x i16> [[WIDE_VEC2]], <8 x i16> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; SSE41-NEXT: [[STRIDED_VEC2:%.*]] = shufflevector <4 x i16> [[WIDE_VEC1]], <4 x i16> poison, <2 x i32> <i32 0, i32 2>
	; SSE41-NEXT: [[STRIDED_VEC4:%.*]] = shufflevector <8 x i16> [[WIDE_VEC2]], <8 x i16> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>			; SSE41-NEXT: [[STRIDED_VEC3:%.*]] = shufflevector <4 x i16> [[WIDE_VEC]], <4 x i16> poison, <2 x i32> <i32 1, i32 3>
	; SSE41-NEXT: [[TMP9:%.*]] = sext <4 x i16> [[STRIDED_VEC3]] to <4 x i32>			; SSE41-NEXT: [[STRIDED_VEC4:%.*]] = shufflevector <4 x i16> [[WIDE_VEC1]], <4 x i16> poison, <2 x i32> <i32 1, i32 3>
	; SSE41-NEXT: [[TMP10:%.*]] = mul nsw <4 x i32> [[TMP9]], [[TMP5]]			; SSE41-NEXT: [[TMP10:%.*]] = sext <2 x i16> [[STRIDED_VEC]] to <2 x i32>
	; SSE41-NEXT: [[TMP11:%.*]] = or i64 [[TMP1]], 1			; SSE41-NEXT: [[TMP11:%.*]] = sext <2 x i16> [[STRIDED_VEC2]] to <2 x i32>
	; SSE41-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP11]]			; SSE41-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 [[S2:%.*]], i64 [[TMP2]]
	; SSE41-NEXT: [[TMP13:%.*]] = sext <4 x i16> [[STRIDED_VEC1]] to <4 x i32>			; SSE41-NEXT: [[TMP13:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP3]]
	; SSE41-NEXT: [[TMP14:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP11]]			; SSE41-NEXT: [[TMP14:%.]] = getelementptr inbounds i16, i16 [[TMP12]], i32 0
	; SSE41-NEXT: [[TMP15:%.*]] = sext <4 x i16> [[STRIDED_VEC4]] to <4 x i32>			; SSE41-NEXT: [[TMP15:%.]] = bitcast i16 [[TMP14]] to <4 x i16>*
	; SSE41-NEXT: [[TMP16:%.*]] = mul nsw <4 x i32> [[TMP15]], [[TMP13]]			; SSE41-NEXT: [[TMP16:%.]] = getelementptr inbounds i16, i16 [[TMP13]], i32 0
	; SSE41-NEXT: [[TMP17:%.*]] = add nsw <4 x i32> [[TMP16]], [[TMP10]]			; SSE41-NEXT: [[TMP17:%.]] = bitcast i16 [[TMP16]] to <4 x i16>*
	; SSE41-NEXT: [[TMP18:%.]] = getelementptr inbounds i32, i32 [[D1:%.*]], i64 [[TMP0]]			; SSE41-NEXT: [[WIDE_VEC5:%.]] = load <4 x i16>, <4 x i16> [[TMP15]], align 2
	; SSE41-NEXT: [[TMP19:%.]] = getelementptr inbounds i32, i32 [[TMP18]], i32 0			; SSE41-NEXT: [[WIDE_VEC6:%.]] = load <4 x i16>, <4 x i16> [[TMP17]], align 2
	; SSE41-NEXT: [[TMP20:%.]] = bitcast i32 [[TMP19]] to <4 x i32>*			; SSE41-NEXT: [[STRIDED_VEC7:%.*]] = shufflevector <4 x i16> [[WIDE_VEC5]], <4 x i16> poison, <2 x i32> <i32 0, i32 2>
	; SSE41-NEXT: store <4 x i32> [[TMP17]], <4 x i32>* [[TMP20]], align 4			; SSE41-NEXT: [[STRIDED_VEC8:%.*]] = shufflevector <4 x i16> [[WIDE_VEC6]], <4 x i16> poison, <2 x i32> <i32 0, i32 2>
				; SSE41-NEXT: [[STRIDED_VEC9:%.*]] = shufflevector <4 x i16> [[WIDE_VEC5]], <4 x i16> poison, <2 x i32> <i32 1, i32 3>
				; SSE41-NEXT: [[STRIDED_VEC10:%.*]] = shufflevector <4 x i16> [[WIDE_VEC6]], <4 x i16> poison, <2 x i32> <i32 1, i32 3>
				; SSE41-NEXT: [[TMP18:%.*]] = sext <2 x i16> [[STRIDED_VEC7]] to <2 x i32>
				; SSE41-NEXT: [[TMP19:%.*]] = sext <2 x i16> [[STRIDED_VEC8]] to <2 x i32>
				; SSE41-NEXT: [[TMP20:%.*]] = mul nsw <2 x i32> [[TMP18]], [[TMP10]]
				; SSE41-NEXT: [[TMP21:%.*]] = mul nsw <2 x i32> [[TMP19]], [[TMP11]]
				; SSE41-NEXT: [[TMP22:%.*]] = or i64 [[TMP2]], 1
				; SSE41-NEXT: [[TMP23:%.*]] = or i64 [[TMP3]], 1
				; SSE41-NEXT: [[TMP24:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP22]]
				; SSE41-NEXT: [[TMP25:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP23]]
				; SSE41-NEXT: [[TMP26:%.*]] = sext <2 x i16> [[STRIDED_VEC3]] to <2 x i32>
				; SSE41-NEXT: [[TMP27:%.*]] = sext <2 x i16> [[STRIDED_VEC4]] to <2 x i32>
				; SSE41-NEXT: [[TMP28:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP22]]
				; SSE41-NEXT: [[TMP29:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP23]]
				; SSE41-NEXT: [[TMP30:%.*]] = sext <2 x i16> [[STRIDED_VEC9]] to <2 x i32>
				; SSE41-NEXT: [[TMP31:%.*]] = sext <2 x i16> [[STRIDED_VEC10]] to <2 x i32>
				; SSE41-NEXT: [[TMP32:%.*]] = mul nsw <2 x i32> [[TMP30]], [[TMP26]]
				; SSE41-NEXT: [[TMP33:%.*]] = mul nsw <2 x i32> [[TMP31]], [[TMP27]]
				; SSE41-NEXT: [[TMP34:%.*]] = add nsw <2 x i32> [[TMP32]], [[TMP20]]
				; SSE41-NEXT: [[TMP35:%.*]] = add nsw <2 x i32> [[TMP33]], [[TMP21]]
				; SSE41-NEXT: [[TMP36:%.]] = getelementptr inbounds i32, i32 [[D1:%.*]], i64 [[TMP0]]
				; SSE41-NEXT: [[TMP37:%.]] = getelementptr inbounds i32, i32 [[D1]], i64 [[TMP1]]
				; SSE41-NEXT: [[TMP38:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 0
				; SSE41-NEXT: [[TMP39:%.]] = bitcast i32 [[TMP38]] to <2 x i32>*
				; SSE41-NEXT: store <2 x i32> [[TMP34]], <2 x i32>* [[TMP39]], align 4
				; SSE41-NEXT: [[TMP40:%.]] = getelementptr inbounds i32, i32 [[TMP36]], i32 2
				; SSE41-NEXT: [[TMP41:%.]] = bitcast i32 [[TMP40]] to <2 x i32>*
				; SSE41-NEXT: store <2 x i32> [[TMP35]], <2 x i32>* [[TMP41]], align 4
	; SSE41-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4			; SSE41-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
	; SSE41-NEXT: [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; SSE41-NEXT: [[TMP42:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
	; SSE41-NEXT: br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; SSE41-NEXT: br i1 [[TMP42]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; SSE41: middle.block:			; SSE41: middle.block:
	; SSE41-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]			; SSE41-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
	; SSE41-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; SSE41-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; SSE41: scalar.ph:			; SSE41: scalar.ph:
	; SSE41-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; SSE41-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; SSE41-NEXT: br label [[FOR_BODY:%.*]]			; SSE41-NEXT: br label [[FOR_BODY:%.*]]
	; SSE41: for.body:			; SSE41: for.body:
	; SSE41-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; SSE41-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; SSE41-NEXT: [[TMP22:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 1			; SSE41-NEXT: [[TMP43:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 1
	; SSE41-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP22]]			; SSE41-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP43]]
	; SSE41-NEXT: [[TMP23:%.]] = load i16, i16 [[ARRAYIDX]], align 2			; SSE41-NEXT: [[TMP44:%.]] = load i16, i16 [[ARRAYIDX]], align 2
	; SSE41-NEXT: [[CONV:%.*]] = sext i16 [[TMP23]] to i32			; SSE41-NEXT: [[CONV:%.*]] = sext i16 [[TMP44]] to i32
	; SSE41-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP22]]			; SSE41-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP43]]
	; SSE41-NEXT: [[TMP24:%.]] = load i16, i16 [[ARRAYIDX4]], align 2			; SSE41-NEXT: [[TMP45:%.]] = load i16, i16 [[ARRAYIDX4]], align 2
	; SSE41-NEXT: [[CONV5:%.*]] = sext i16 [[TMP24]] to i32			; SSE41-NEXT: [[CONV5:%.*]] = sext i16 [[TMP45]] to i32
	; SSE41-NEXT: [[MUL6:%.*]] = mul nsw i32 [[CONV5]], [[CONV]]			; SSE41-NEXT: [[MUL6:%.*]] = mul nsw i32 [[CONV5]], [[CONV]]
	; SSE41-NEXT: [[TMP25:%.*]] = or i64 [[TMP22]], 1			; SSE41-NEXT: [[TMP46:%.*]] = or i64 [[TMP43]], 1
	; SSE41-NEXT: [[ARRAYIDX10:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP25]]			; SSE41-NEXT: [[ARRAYIDX10:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP46]]
	; SSE41-NEXT: [[TMP26:%.]] = load i16, i16 [[ARRAYIDX10]], align 2			; SSE41-NEXT: [[TMP47:%.]] = load i16, i16 [[ARRAYIDX10]], align 2
	; SSE41-NEXT: [[CONV11:%.*]] = sext i16 [[TMP26]] to i32			; SSE41-NEXT: [[CONV11:%.*]] = sext i16 [[TMP47]] to i32
	; SSE41-NEXT: [[ARRAYIDX15:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP25]]			; SSE41-NEXT: [[ARRAYIDX15:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP46]]
	; SSE41-NEXT: [[TMP27:%.]] = load i16, i16 [[ARRAYIDX15]], align 2			; SSE41-NEXT: [[TMP48:%.]] = load i16, i16 [[ARRAYIDX15]], align 2
	; SSE41-NEXT: [[CONV16:%.*]] = sext i16 [[TMP27]] to i32			; SSE41-NEXT: [[CONV16:%.*]] = sext i16 [[TMP48]] to i32
	; SSE41-NEXT: [[MUL17:%.*]] = mul nsw i32 [[CONV16]], [[CONV11]]			; SSE41-NEXT: [[MUL17:%.*]] = mul nsw i32 [[CONV16]], [[CONV11]]
	; SSE41-NEXT: [[ADD18:%.*]] = add nsw i32 [[MUL17]], [[MUL6]]			; SSE41-NEXT: [[ADD18:%.*]] = add nsw i32 [[MUL17]], [[MUL6]]
	; SSE41-NEXT: [[ARRAYIDX20:%.]] = getelementptr inbounds i32, i32 [[D1]], i64 [[INDVARS_IV]]			; SSE41-NEXT: [[ARRAYIDX20:%.]] = getelementptr inbounds i32, i32 [[D1]], i64 [[INDVARS_IV]]
	; SSE41-NEXT: store i32 [[ADD18]], i32* [[ARRAYIDX20]], align 4			; SSE41-NEXT: store i32 [[ADD18]], i32* [[ARRAYIDX20]], align 4
	; SSE41-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; SSE41-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; SSE41-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; SSE41-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; SSE41-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]			; SSE41-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
	; SSE41: for.end.loopexit:			; SSE41: for.end.loopexit:
	; SSE41-NEXT: br label [[FOR_END]]			; SSE41-NEXT: br label [[FOR_END]]
	; SSE41: for.end:			; SSE41: for.end:
	; SSE41-NEXT: ret void			; SSE41-NEXT: ret void
	;			;
	; AVX1-LABEL: @test_muladd(			; AVX1-LABEL: @test_muladd(
	; AVX1-NEXT: entry:			; AVX1-NEXT: entry:
	; AVX1-NEXT: [[CMP30:%.]] = icmp sgt i32 [[N:%.]], 0			; AVX1-NEXT: [[CMP30:%.]] = icmp sgt i32 [[N:%.]], 0
	; AVX1-NEXT: br i1 [[CMP30]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_END:%.]]			; AVX1-NEXT: br i1 [[CMP30]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_END:%.]]
	; AVX1: for.body.preheader:			; AVX1: for.body.preheader:
	; AVX1-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext i32 [[N]] to i64			; AVX1-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext i32 [[N]] to i64
	; AVX1-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 4			; AVX1-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[WIDE_TRIP_COUNT]], 8
	; AVX1-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; AVX1-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; AVX1: vector.ph:			; AVX1: vector.ph:
	; AVX1-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 4			; AVX1-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[WIDE_TRIP_COUNT]], 8
	; AVX1-NEXT: [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]			; AVX1-NEXT: [[N_VEC:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[N_MOD_VF]]
	; AVX1-NEXT: br label [[VECTOR_BODY:%.*]]			; AVX1-NEXT: br label [[VECTOR_BODY:%.*]]
	; AVX1: vector.body:			; AVX1: vector.body:
	; AVX1-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; AVX1-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; AVX1-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; AVX1-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; AVX1-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 1			; AVX1-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 2
	; AVX1-NEXT: [[TMP2:%.]] = getelementptr inbounds i16, i16 [[S1:%.*]], i64 [[TMP1]]			; AVX1-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 4
	; AVX1-NEXT: [[TMP3:%.]] = getelementptr inbounds i16, i16 [[TMP2]], i32 0			; AVX1-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 6
	; AVX1-NEXT: [[TMP4:%.]] = bitcast i16 [[TMP3]] to <8 x i16>*			; AVX1-NEXT: [[TMP4:%.*]] = shl nuw nsw i64 [[TMP0]], 1
	; AVX1-NEXT: [[WIDE_VEC:%.]] = load <8 x i16>, <8 x i16> [[TMP4]], align 2			; AVX1-NEXT: [[TMP5:%.*]] = shl nuw nsw i64 [[TMP1]], 1
	; AVX1-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <8 x i16> [[WIDE_VEC]], <8 x i16> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; AVX1-NEXT: [[TMP6:%.*]] = shl nuw nsw i64 [[TMP2]], 1
	; AVX1-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <8 x i16> [[WIDE_VEC]], <8 x i16> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>			; AVX1-NEXT: [[TMP7:%.*]] = shl nuw nsw i64 [[TMP3]], 1
	; AVX1-NEXT: [[TMP5:%.*]] = sext <4 x i16> [[STRIDED_VEC]] to <4 x i32>			; AVX1-NEXT: [[TMP8:%.]] = getelementptr inbounds i16, i16 [[S1:%.*]], i64 [[TMP4]]
	; AVX1-NEXT: [[TMP6:%.]] = getelementptr inbounds i16, i16 [[S2:%.*]], i64 [[TMP1]]			; AVX1-NEXT: [[TMP9:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP5]]
	; AVX1-NEXT: [[TMP7:%.]] = getelementptr inbounds i16, i16 [[TMP6]], i32 0			; AVX1-NEXT: [[TMP10:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP6]]
	; AVX1-NEXT: [[TMP8:%.]] = bitcast i16 [[TMP7]] to <8 x i16>*			; AVX1-NEXT: [[TMP11:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP7]]
	; AVX1-NEXT: [[WIDE_VEC2:%.]] = load <8 x i16>, <8 x i16> [[TMP8]], align 2			; AVX1-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 [[TMP8]], i32 0
	; AVX1-NEXT: [[STRIDED_VEC3:%.*]] = shufflevector <8 x i16> [[WIDE_VEC2]], <8 x i16> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; AVX1-NEXT: [[TMP13:%.]] = bitcast i16 [[TMP12]] to <4 x i16>*
	; AVX1-NEXT: [[STRIDED_VEC4:%.*]] = shufflevector <8 x i16> [[WIDE_VEC2]], <8 x i16> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>			; AVX1-NEXT: [[TMP14:%.]] = getelementptr inbounds i16, i16 [[TMP9]], i32 0
	; AVX1-NEXT: [[TMP9:%.*]] = sext <4 x i16> [[STRIDED_VEC3]] to <4 x i32>			; AVX1-NEXT: [[TMP15:%.]] = bitcast i16 [[TMP14]] to <4 x i16>*
	; AVX1-NEXT: [[TMP10:%.*]] = mul nsw <4 x i32> [[TMP9]], [[TMP5]]			; AVX1-NEXT: [[TMP16:%.]] = getelementptr inbounds i16, i16 [[TMP10]], i32 0
	; AVX1-NEXT: [[TMP11:%.*]] = or i64 [[TMP1]], 1			; AVX1-NEXT: [[TMP17:%.]] = bitcast i16 [[TMP16]] to <4 x i16>*
	; AVX1-NEXT: [[TMP12:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP11]]			; AVX1-NEXT: [[TMP18:%.]] = getelementptr inbounds i16, i16 [[TMP11]], i32 0
	; AVX1-NEXT: [[TMP13:%.*]] = sext <4 x i16> [[STRIDED_VEC1]] to <4 x i32>			; AVX1-NEXT: [[TMP19:%.]] = bitcast i16 [[TMP18]] to <4 x i16>*
	; AVX1-NEXT: [[TMP14:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP11]]			; AVX1-NEXT: [[WIDE_VEC:%.]] = load <4 x i16>, <4 x i16> [[TMP13]], align 2
	; AVX1-NEXT: [[TMP15:%.*]] = sext <4 x i16> [[STRIDED_VEC4]] to <4 x i32>			; AVX1-NEXT: [[WIDE_VEC1:%.]] = load <4 x i16>, <4 x i16> [[TMP15]], align 2
	; AVX1-NEXT: [[TMP16:%.*]] = mul nsw <4 x i32> [[TMP15]], [[TMP13]]			; AVX1-NEXT: [[WIDE_VEC2:%.]] = load <4 x i16>, <4 x i16> [[TMP17]], align 2
	; AVX1-NEXT: [[TMP17:%.*]] = add nsw <4 x i32> [[TMP16]], [[TMP10]]			; AVX1-NEXT: [[WIDE_VEC3:%.]] = load <4 x i16>, <4 x i16> [[TMP19]], align 2
	; AVX1-NEXT: [[TMP18:%.]] = getelementptr inbounds i32, i32 [[D1:%.*]], i64 [[TMP0]]			; AVX1-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <4 x i16> [[WIDE_VEC]], <4 x i16> poison, <2 x i32> <i32 0, i32 2>
	; AVX1-NEXT: [[TMP19:%.]] = getelementptr inbounds i32, i32 [[TMP18]], i32 0			; AVX1-NEXT: [[STRIDED_VEC4:%.*]] = shufflevector <4 x i16> [[WIDE_VEC1]], <4 x i16> poison, <2 x i32> <i32 0, i32 2>
	; AVX1-NEXT: [[TMP20:%.]] = bitcast i32 [[TMP19]] to <4 x i32>*			; AVX1-NEXT: [[STRIDED_VEC5:%.*]] = shufflevector <4 x i16> [[WIDE_VEC2]], <4 x i16> poison, <2 x i32> <i32 0, i32 2>
	; AVX1-NEXT: store <4 x i32> [[TMP17]], <4 x i32>* [[TMP20]], align 4			; AVX1-NEXT: [[STRIDED_VEC6:%.*]] = shufflevector <4 x i16> [[WIDE_VEC3]], <4 x i16> poison, <2 x i32> <i32 0, i32 2>
	; AVX1-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4			; AVX1-NEXT: [[STRIDED_VEC7:%.*]] = shufflevector <4 x i16> [[WIDE_VEC]], <4 x i16> poison, <2 x i32> <i32 1, i32 3>
	; AVX1-NEXT: [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; AVX1-NEXT: [[STRIDED_VEC8:%.*]] = shufflevector <4 x i16> [[WIDE_VEC1]], <4 x i16> poison, <2 x i32> <i32 1, i32 3>
	; AVX1-NEXT: br i1 [[TMP21]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; AVX1-NEXT: [[STRIDED_VEC9:%.*]] = shufflevector <4 x i16> [[WIDE_VEC2]], <4 x i16> poison, <2 x i32> <i32 1, i32 3>
				; AVX1-NEXT: [[STRIDED_VEC10:%.*]] = shufflevector <4 x i16> [[WIDE_VEC3]], <4 x i16> poison, <2 x i32> <i32 1, i32 3>
				; AVX1-NEXT: [[TMP20:%.*]] = sext <2 x i16> [[STRIDED_VEC]] to <2 x i32>
				; AVX1-NEXT: [[TMP21:%.*]] = sext <2 x i16> [[STRIDED_VEC4]] to <2 x i32>
				; AVX1-NEXT: [[TMP22:%.*]] = sext <2 x i16> [[STRIDED_VEC5]] to <2 x i32>
				; AVX1-NEXT: [[TMP23:%.*]] = sext <2 x i16> [[STRIDED_VEC6]] to <2 x i32>
				; AVX1-NEXT: [[TMP24:%.]] = getelementptr inbounds i16, i16 [[S2:%.*]], i64 [[TMP4]]
				; AVX1-NEXT: [[TMP25:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP5]]
				; AVX1-NEXT: [[TMP26:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP6]]
				; AVX1-NEXT: [[TMP27:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP7]]
				; AVX1-NEXT: [[TMP28:%.]] = getelementptr inbounds i16, i16 [[TMP24]], i32 0
				; AVX1-NEXT: [[TMP29:%.]] = bitcast i16 [[TMP28]] to <4 x i16>*
				; AVX1-NEXT: [[TMP30:%.]] = getelementptr inbounds i16, i16 [[TMP25]], i32 0
				; AVX1-NEXT: [[TMP31:%.]] = bitcast i16 [[TMP30]] to <4 x i16>*
				; AVX1-NEXT: [[TMP32:%.]] = getelementptr inbounds i16, i16 [[TMP26]], i32 0
				; AVX1-NEXT: [[TMP33:%.]] = bitcast i16 [[TMP32]] to <4 x i16>*
				; AVX1-NEXT: [[TMP34:%.]] = getelementptr inbounds i16, i16 [[TMP27]], i32 0
				; AVX1-NEXT: [[TMP35:%.]] = bitcast i16 [[TMP34]] to <4 x i16>*
				; AVX1-NEXT: [[WIDE_VEC11:%.]] = load <4 x i16>, <4 x i16> [[TMP29]], align 2
				; AVX1-NEXT: [[WIDE_VEC12:%.]] = load <4 x i16>, <4 x i16> [[TMP31]], align 2
				; AVX1-NEXT: [[WIDE_VEC13:%.]] = load <4 x i16>, <4 x i16> [[TMP33]], align 2
				; AVX1-NEXT: [[WIDE_VEC14:%.]] = load <4 x i16>, <4 x i16> [[TMP35]], align 2
				; AVX1-NEXT: [[STRIDED_VEC15:%.*]] = shufflevector <4 x i16> [[WIDE_VEC11]], <4 x i16> poison, <2 x i32> <i32 0, i32 2>
				; AVX1-NEXT: [[STRIDED_VEC16:%.*]] = shufflevector <4 x i16> [[WIDE_VEC12]], <4 x i16> poison, <2 x i32> <i32 0, i32 2>
				; AVX1-NEXT: [[STRIDED_VEC17:%.*]] = shufflevector <4 x i16> [[WIDE_VEC13]], <4 x i16> poison, <2 x i32> <i32 0, i32 2>
				; AVX1-NEXT: [[STRIDED_VEC18:%.*]] = shufflevector <4 x i16> [[WIDE_VEC14]], <4 x i16> poison, <2 x i32> <i32 0, i32 2>
				; AVX1-NEXT: [[STRIDED_VEC19:%.*]] = shufflevector <4 x i16> [[WIDE_VEC11]], <4 x i16> poison, <2 x i32> <i32 1, i32 3>
				; AVX1-NEXT: [[STRIDED_VEC20:%.*]] = shufflevector <4 x i16> [[WIDE_VEC12]], <4 x i16> poison, <2 x i32> <i32 1, i32 3>
				; AVX1-NEXT: [[STRIDED_VEC21:%.*]] = shufflevector <4 x i16> [[WIDE_VEC13]], <4 x i16> poison, <2 x i32> <i32 1, i32 3>
				; AVX1-NEXT: [[STRIDED_VEC22:%.*]] = shufflevector <4 x i16> [[WIDE_VEC14]], <4 x i16> poison, <2 x i32> <i32 1, i32 3>
				; AVX1-NEXT: [[TMP36:%.*]] = sext <2 x i16> [[STRIDED_VEC15]] to <2 x i32>
				; AVX1-NEXT: [[TMP37:%.*]] = sext <2 x i16> [[STRIDED_VEC16]] to <2 x i32>
				; AVX1-NEXT: [[TMP38:%.*]] = sext <2 x i16> [[STRIDED_VEC17]] to <2 x i32>
				; AVX1-NEXT: [[TMP39:%.*]] = sext <2 x i16> [[STRIDED_VEC18]] to <2 x i32>
				; AVX1-NEXT: [[TMP40:%.*]] = mul nsw <2 x i32> [[TMP36]], [[TMP20]]
				; AVX1-NEXT: [[TMP41:%.*]] = mul nsw <2 x i32> [[TMP37]], [[TMP21]]
				; AVX1-NEXT: [[TMP42:%.*]] = mul nsw <2 x i32> [[TMP38]], [[TMP22]]
				; AVX1-NEXT: [[TMP43:%.*]] = mul nsw <2 x i32> [[TMP39]], [[TMP23]]
				; AVX1-NEXT: [[TMP44:%.*]] = or i64 [[TMP4]], 1
				; AVX1-NEXT: [[TMP45:%.*]] = or i64 [[TMP5]], 1
				; AVX1-NEXT: [[TMP46:%.*]] = or i64 [[TMP6]], 1
				; AVX1-NEXT: [[TMP47:%.*]] = or i64 [[TMP7]], 1
				; AVX1-NEXT: [[TMP48:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP44]]
				; AVX1-NEXT: [[TMP49:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP45]]
				; AVX1-NEXT: [[TMP50:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP46]]
				; AVX1-NEXT: [[TMP51:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP47]]
				; AVX1-NEXT: [[TMP52:%.*]] = sext <2 x i16> [[STRIDED_VEC7]] to <2 x i32>
				; AVX1-NEXT: [[TMP53:%.*]] = sext <2 x i16> [[STRIDED_VEC8]] to <2 x i32>
				; AVX1-NEXT: [[TMP54:%.*]] = sext <2 x i16> [[STRIDED_VEC9]] to <2 x i32>
				; AVX1-NEXT: [[TMP55:%.*]] = sext <2 x i16> [[STRIDED_VEC10]] to <2 x i32>
				; AVX1-NEXT: [[TMP56:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP44]]
				; AVX1-NEXT: [[TMP57:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP45]]
				; AVX1-NEXT: [[TMP58:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP46]]
				; AVX1-NEXT: [[TMP59:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP47]]
				; AVX1-NEXT: [[TMP60:%.*]] = sext <2 x i16> [[STRIDED_VEC19]] to <2 x i32>
				; AVX1-NEXT: [[TMP61:%.*]] = sext <2 x i16> [[STRIDED_VEC20]] to <2 x i32>
				; AVX1-NEXT: [[TMP62:%.*]] = sext <2 x i16> [[STRIDED_VEC21]] to <2 x i32>
				; AVX1-NEXT: [[TMP63:%.*]] = sext <2 x i16> [[STRIDED_VEC22]] to <2 x i32>
				; AVX1-NEXT: [[TMP64:%.*]] = mul nsw <2 x i32> [[TMP60]], [[TMP52]]
				; AVX1-NEXT: [[TMP65:%.*]] = mul nsw <2 x i32> [[TMP61]], [[TMP53]]
				; AVX1-NEXT: [[TMP66:%.*]] = mul nsw <2 x i32> [[TMP62]], [[TMP54]]
				; AVX1-NEXT: [[TMP67:%.*]] = mul nsw <2 x i32> [[TMP63]], [[TMP55]]
				; AVX1-NEXT: [[TMP68:%.*]] = add nsw <2 x i32> [[TMP64]], [[TMP40]]
				; AVX1-NEXT: [[TMP69:%.*]] = add nsw <2 x i32> [[TMP65]], [[TMP41]]
				; AVX1-NEXT: [[TMP70:%.*]] = add nsw <2 x i32> [[TMP66]], [[TMP42]]
				; AVX1-NEXT: [[TMP71:%.*]] = add nsw <2 x i32> [[TMP67]], [[TMP43]]
				; AVX1-NEXT: [[TMP72:%.]] = getelementptr inbounds i32, i32 [[D1:%.*]], i64 [[TMP0]]
				; AVX1-NEXT: [[TMP73:%.]] = getelementptr inbounds i32, i32 [[D1]], i64 [[TMP1]]
				; AVX1-NEXT: [[TMP74:%.]] = getelementptr inbounds i32, i32 [[D1]], i64 [[TMP2]]
				; AVX1-NEXT: [[TMP75:%.]] = getelementptr inbounds i32, i32 [[D1]], i64 [[TMP3]]
				; AVX1-NEXT: [[TMP76:%.]] = getelementptr inbounds i32, i32 [[TMP72]], i32 0
				; AVX1-NEXT: [[TMP77:%.]] = bitcast i32 [[TMP76]] to <2 x i32>*
				; AVX1-NEXT: store <2 x i32> [[TMP68]], <2 x i32>* [[TMP77]], align 4
				; AVX1-NEXT: [[TMP78:%.]] = getelementptr inbounds i32, i32 [[TMP72]], i32 2
				; AVX1-NEXT: [[TMP79:%.]] = bitcast i32 [[TMP78]] to <2 x i32>*
				; AVX1-NEXT: store <2 x i32> [[TMP69]], <2 x i32>* [[TMP79]], align 4
				; AVX1-NEXT: [[TMP80:%.]] = getelementptr inbounds i32, i32 [[TMP72]], i32 4
				; AVX1-NEXT: [[TMP81:%.]] = bitcast i32 [[TMP80]] to <2 x i32>*
				; AVX1-NEXT: store <2 x i32> [[TMP70]], <2 x i32>* [[TMP81]], align 4
				; AVX1-NEXT: [[TMP82:%.]] = getelementptr inbounds i32, i32 [[TMP72]], i32 6
				; AVX1-NEXT: [[TMP83:%.]] = bitcast i32 [[TMP82]] to <2 x i32>*
				; AVX1-NEXT: store <2 x i32> [[TMP71]], <2 x i32>* [[TMP83]], align 4
				; AVX1-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
				; AVX1-NEXT: [[TMP84:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; AVX1-NEXT: br i1 [[TMP84]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; AVX1: middle.block:			; AVX1: middle.block:
	; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]			; AVX1-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[WIDE_TRIP_COUNT]], [[N_VEC]]
	; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; AVX1-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AVX1: scalar.ph:			; AVX1: scalar.ph:
	; AVX1-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; AVX1-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; AVX1-NEXT: br label [[FOR_BODY:%.*]]			; AVX1-NEXT: br label [[FOR_BODY:%.*]]
	; AVX1: for.body:			; AVX1: for.body:
	; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; AVX1-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; AVX1-NEXT: [[TMP22:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 1			; AVX1-NEXT: [[TMP85:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 1
	; AVX1-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP22]]			; AVX1-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP85]]
	; AVX1-NEXT: [[TMP23:%.]] = load i16, i16 [[ARRAYIDX]], align 2			; AVX1-NEXT: [[TMP86:%.]] = load i16, i16 [[ARRAYIDX]], align 2
	; AVX1-NEXT: [[CONV:%.*]] = sext i16 [[TMP23]] to i32			; AVX1-NEXT: [[CONV:%.*]] = sext i16 [[TMP86]] to i32
	; AVX1-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP22]]			; AVX1-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP85]]
	; AVX1-NEXT: [[TMP24:%.]] = load i16, i16 [[ARRAYIDX4]], align 2			; AVX1-NEXT: [[TMP87:%.]] = load i16, i16 [[ARRAYIDX4]], align 2
	; AVX1-NEXT: [[CONV5:%.*]] = sext i16 [[TMP24]] to i32			; AVX1-NEXT: [[CONV5:%.*]] = sext i16 [[TMP87]] to i32
	; AVX1-NEXT: [[MUL6:%.*]] = mul nsw i32 [[CONV5]], [[CONV]]			; AVX1-NEXT: [[MUL6:%.*]] = mul nsw i32 [[CONV5]], [[CONV]]
	; AVX1-NEXT: [[TMP25:%.*]] = or i64 [[TMP22]], 1			; AVX1-NEXT: [[TMP88:%.*]] = or i64 [[TMP85]], 1
	; AVX1-NEXT: [[ARRAYIDX10:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP25]]			; AVX1-NEXT: [[ARRAYIDX10:%.]] = getelementptr inbounds i16, i16 [[S1]], i64 [[TMP88]]
	; AVX1-NEXT: [[TMP26:%.]] = load i16, i16 [[ARRAYIDX10]], align 2			; AVX1-NEXT: [[TMP89:%.]] = load i16, i16 [[ARRAYIDX10]], align 2
	; AVX1-NEXT: [[CONV11:%.*]] = sext i16 [[TMP26]] to i32			; AVX1-NEXT: [[CONV11:%.*]] = sext i16 [[TMP89]] to i32
	; AVX1-NEXT: [[ARRAYIDX15:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP25]]			; AVX1-NEXT: [[ARRAYIDX15:%.]] = getelementptr inbounds i16, i16 [[S2]], i64 [[TMP88]]
	; AVX1-NEXT: [[TMP27:%.]] = load i16, i16 [[ARRAYIDX15]], align 2			; AVX1-NEXT: [[TMP90:%.]] = load i16, i16 [[ARRAYIDX15]], align 2
	; AVX1-NEXT: [[CONV16:%.*]] = sext i16 [[TMP27]] to i32			; AVX1-NEXT: [[CONV16:%.*]] = sext i16 [[TMP90]] to i32
	; AVX1-NEXT: [[MUL17:%.*]] = mul nsw i32 [[CONV16]], [[CONV11]]			; AVX1-NEXT: [[MUL17:%.*]] = mul nsw i32 [[CONV16]], [[CONV11]]
	; AVX1-NEXT: [[ADD18:%.*]] = add nsw i32 [[MUL17]], [[MUL6]]			; AVX1-NEXT: [[ADD18:%.*]] = add nsw i32 [[MUL17]], [[MUL6]]
	; AVX1-NEXT: [[ARRAYIDX20:%.]] = getelementptr inbounds i32, i32 [[D1]], i64 [[INDVARS_IV]]			; AVX1-NEXT: [[ARRAYIDX20:%.]] = getelementptr inbounds i32, i32 [[D1]], i64 [[INDVARS_IV]]
	; AVX1-NEXT: store i32 [[ADD18]], i32* [[ARRAYIDX20]], align 4			; AVX1-NEXT: store i32 [[ADD18]], i32* [[ARRAYIDX20]], align 4
	; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX1-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX1-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]			; AVX1-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[WIDE_TRIP_COUNT]]
	; AVX1-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]			; AVX1-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
	; AVX1: for.end.loopexit:			; AVX1: for.end.loopexit:
	▲ Show 20 Lines • Show All 123 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 380187

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

llvm/test/Analysis/CostModel/X86/interleaved-load-f32-stride-2.ll

llvm/test/Analysis/CostModel/X86/interleaved-load-f64-stride-2.ll

llvm/test/Analysis/CostModel/X86/interleaved-load-i16-stride-2.ll

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-2-indices-0u.ll

llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-2.ll

llvm/test/Analysis/CostModel/X86/interleaved-load-i64-stride-2.ll

llvm/test/Transforms/LoopVectorize/X86/interleaving.ll

llvm/test/Transforms/LoopVectorize/X86/pr47437.ll

[TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costs
ClosedPublic