This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Reduce FADD/FSUB/FMUL costs on later targets (PR36280)
ClosedPublic

Authored by RKSimon on Feb 24 2018, 10:03 AM.

Download Raw Diff

Details

Reviewers

ABataev
craig.topper
spatel

Commits

rG9929f9074033: [X86][SSE] Reduce FADD/FSUB/FMUL costs on later targets (PR36280)
rL326133: [X86][SSE] Reduce FADD/FSUB/FMUL costs on later targets (PR36280)

Summary

Agner's tables indicate that for SSE42+ targets (Core2 and later) we can reduce the FADD/FSUB/FMUL costs down to 1, which should fix the Himeno benchmark.

Note: the AVX512 FDIV costs look rather dodgy, but this isn't part of this patch.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon created this revision.Feb 24 2018, 10:03 AM

fhahn added a subscriber: fhahn.Feb 24 2018, 10:06 AM

spatel mentioned this in D43769: [TTI] rename getArithmeticInstructionCost() to getUnitThroughput(); NFC.Feb 26 2018, 10:18 AM

This revision is now accepted and ready to land.Feb 26 2018, 11:19 AM

Closed by commit rL326133: [X86][SSE] Reduce FADD/FSUB/FMUL costs on later targets (PR36280) (authored by RKSimon). · Explain WhyFeb 26 2018, 2:13 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

	X86TargetTransformInfo.cpp
	X86TargetTransformInfo.cpp (revision 326035)

30 lines

test/

Analysis/

CostModel/

X86/

	arith-fp.ll
	arith-fp.ll (revision 326035)

168 lines

	intrinsic-cost.ll
	intrinsic-cost.ll (revision 326035)

2 lines

Transforms/

SLPVectorizer/

X86/

	PR36280.ll
	PR36280.ll (revision 326035)

15 lines

	cse.ll
	cse.ll (revision 326035)

21 lines

	horizontal.ll
	horizontal.ll (revision 326035)

148 lines

	reorder_phi.ll
	reorder_phi.ll (revision 326035)

59 lines

	simplebb.ll
	simplebb.ll (revision 326035)

12 lines

Diff 135800

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 432 Lines • ▼ Show 20 Lines	static const CostTblEntry AVX512CostTable[] = {

{ ISD::MUL, MVT::v32i8, 13 }, // extend/pmullw/trunc sequence.		{ ISD::MUL, MVT::v32i8, 13 }, // extend/pmullw/trunc sequence.
{ ISD::MUL, MVT::v16i8, 5 }, // extend/pmullw/trunc sequence.		{ ISD::MUL, MVT::v16i8, 5 }, // extend/pmullw/trunc sequence.
{ ISD::MUL, MVT::v16i32, 1 }, // pmulld (Skylake from agner.org)		{ ISD::MUL, MVT::v16i32, 1 }, // pmulld (Skylake from agner.org)
{ ISD::MUL, MVT::v8i32, 1 }, // pmulld (Skylake from agner.org)		{ ISD::MUL, MVT::v8i32, 1 }, // pmulld (Skylake from agner.org)
{ ISD::MUL, MVT::v4i32, 1 }, // pmulld (Skylake from agner.org)		{ ISD::MUL, MVT::v4i32, 1 }, // pmulld (Skylake from agner.org)
{ ISD::MUL, MVT::v8i64, 8 }, // 3pmuludq/3shift/2*add		{ ISD::MUL, MVT::v8i64, 8 }, // 3pmuludq/3shift/2*add

		{ ISD::FADD, MVT::v8f64, 1 }, // Skylake from http://www.agner.org/
		{ ISD::FSUB, MVT::v8f64, 1 }, // Skylake from http://www.agner.org/
		{ ISD::FMUL, MVT::v8f64, 1 }, // Skylake from http://www.agner.org/

		{ ISD::FADD, MVT::v16f32, 1 }, // Skylake from http://www.agner.org/
		{ ISD::FSUB, MVT::v16f32, 1 }, // Skylake from http://www.agner.org/
		{ ISD::FMUL, MVT::v16f32, 1 }, // Skylake from http://www.agner.org/

// Vectorizing division is a bad idea. See the SSE2 table for more comments.		// Vectorizing division is a bad idea. See the SSE2 table for more comments.
{ ISD::SDIV, MVT::v16i32, 16*20 },		{ ISD::SDIV, MVT::v16i32, 16*20 },
{ ISD::SDIV, MVT::v8i64, 8*20 },		{ ISD::SDIV, MVT::v8i64, 8*20 },
{ ISD::UDIV, MVT::v16i32, 16*20 },		{ ISD::UDIV, MVT::v16i32, 16*20 },
{ ISD::UDIV, MVT::v8i64, 8*20 }		{ ISD::UDIV, MVT::v8i64, 8*20 }
};		};

if (ST->hasAVX512())		if (ST->hasAVX512())
▲ Show 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	static const CostTblEntry AVX2CostTable[] = {
{ ISD::ADD, MVT::v4i64, 1 }, // paddq		{ ISD::ADD, MVT::v4i64, 1 }, // paddq

{ ISD::MUL, MVT::v32i8, 17 }, // extend/pmullw/trunc sequence.		{ ISD::MUL, MVT::v32i8, 17 }, // extend/pmullw/trunc sequence.
{ ISD::MUL, MVT::v16i8, 7 }, // extend/pmullw/trunc sequence.		{ ISD::MUL, MVT::v16i8, 7 }, // extend/pmullw/trunc sequence.
{ ISD::MUL, MVT::v16i16, 1 }, // pmullw		{ ISD::MUL, MVT::v16i16, 1 }, // pmullw
{ ISD::MUL, MVT::v8i32, 2 }, // pmulld (Haswell from agner.org)		{ ISD::MUL, MVT::v8i32, 2 }, // pmulld (Haswell from agner.org)
{ ISD::MUL, MVT::v4i64, 8 }, // 3pmuludq/3shift/2*add		{ ISD::MUL, MVT::v4i64, 8 }, // 3pmuludq/3shift/2*add

		{ ISD::FADD, MVT::v4f64, 1 }, // Haswell from http://www.agner.org/
		{ ISD::FADD, MVT::v8f32, 1 }, // Haswell from http://www.agner.org/
		{ ISD::FSUB, MVT::v4f64, 1 }, // Haswell from http://www.agner.org/
		{ ISD::FSUB, MVT::v8f32, 1 }, // Haswell from http://www.agner.org/
		{ ISD::FMUL, MVT::v4f64, 1 }, // Haswell from http://www.agner.org/
		{ ISD::FMUL, MVT::v8f32, 1 }, // Haswell from http://www.agner.org/

{ ISD::FDIV, MVT::f32, 7 }, // Haswell from http://www.agner.org/		{ ISD::FDIV, MVT::f32, 7 }, // Haswell from http://www.agner.org/
{ ISD::FDIV, MVT::v4f32, 7 }, // Haswell from http://www.agner.org/		{ ISD::FDIV, MVT::v4f32, 7 }, // Haswell from http://www.agner.org/
{ ISD::FDIV, MVT::v8f32, 14 }, // Haswell from http://www.agner.org/		{ ISD::FDIV, MVT::v8f32, 14 }, // Haswell from http://www.agner.org/
{ ISD::FDIV, MVT::f64, 14 }, // Haswell from http://www.agner.org/		{ ISD::FDIV, MVT::f64, 14 }, // Haswell from http://www.agner.org/
{ ISD::FDIV, MVT::v2f64, 14 }, // Haswell from http://www.agner.org/		{ ISD::FDIV, MVT::v2f64, 14 }, // Haswell from http://www.agner.org/
{ ISD::FDIV, MVT::v4f64, 28 }, // Haswell from http://www.agner.org/		{ ISD::FDIV, MVT::v4f64, 28 }, // Haswell from http://www.agner.org/
};		};

▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	static const CostTblEntry AVX1CostTable[] = {
{ ISD::UDIV, MVT::v4i64, 4*20 },		{ ISD::UDIV, MVT::v4i64, 4*20 },
};		};

if (ST->hasAVX())		if (ST->hasAVX())
if (const auto *Entry = CostTableLookup(AVX1CostTable, ISD, LT.second))		if (const auto *Entry = CostTableLookup(AVX1CostTable, ISD, LT.second))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;

static const CostTblEntry SSE42CostTable[] = {		static const CostTblEntry SSE42CostTable[] = {
		{ ISD::FADD, MVT::f64, 1 }, // Nehalem from http://www.agner.org/
		{ ISD::FADD, MVT::f32, 1 }, // Nehalem from http://www.agner.org/
		{ ISD::FADD, MVT::v2f64, 1 }, // Nehalem from http://www.agner.org/
		{ ISD::FADD, MVT::v4f32, 1 }, // Nehalem from http://www.agner.org/

		{ ISD::FSUB, MVT::f64, 1 }, // Nehalem from http://www.agner.org/
		{ ISD::FSUB, MVT::f32 , 1 }, // Nehalem from http://www.agner.org/
		{ ISD::FSUB, MVT::v2f64, 1 }, // Nehalem from http://www.agner.org/
		{ ISD::FSUB, MVT::v4f32, 1 }, // Nehalem from http://www.agner.org/

		{ ISD::FMUL, MVT::f64, 1 }, // Nehalem from http://www.agner.org/
		{ ISD::FMUL, MVT::f32, 1 }, // Nehalem from http://www.agner.org/
		{ ISD::FMUL, MVT::v2f64, 1 }, // Nehalem from http://www.agner.org/
		{ ISD::FMUL, MVT::v4f32, 1 }, // Nehalem from http://www.agner.org/

{ ISD::FDIV, MVT::f32, 14 }, // Nehalem from http://www.agner.org/		{ ISD::FDIV, MVT::f32, 14 }, // Nehalem from http://www.agner.org/
{ ISD::FDIV, MVT::v4f32, 14 }, // Nehalem from http://www.agner.org/		{ ISD::FDIV, MVT::v4f32, 14 }, // Nehalem from http://www.agner.org/
{ ISD::FDIV, MVT::f64, 22 }, // Nehalem from http://www.agner.org/		{ ISD::FDIV, MVT::f64, 22 }, // Nehalem from http://www.agner.org/
{ ISD::FDIV, MVT::v2f64, 22 }, // Nehalem from http://www.agner.org/		{ ISD::FDIV, MVT::v2f64, 22 }, // Nehalem from http://www.agner.org/
};		};

if (ST->hasSSE42())		if (ST->hasSSE42())
if (const auto *Entry = CostTableLookup(SSE42CostTable, ISD, LT.second))		if (const auto *Entry = CostTableLookup(SSE42CostTable, ISD, LT.second))
▲ Show 20 Lines • Show All 2,221 Lines • Show Last 20 Lines

test/Analysis/CostModel/X86/arith-fp.ll

	; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE2			; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE2
	; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse4.2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE42			; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse4.2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE42
	; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX			; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX
	; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx2,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX2			; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx2,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX2
	; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512F			; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512F
	; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f,+avx512bw \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512BW			; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f,+avx512bw \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512BW

	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx10.8.0"			target triple = "x86_64-apple-macosx10.8.0"

	; CHECK-LABEL: 'fadd'			; CHECK-LABEL: 'fadd'
	define i32 @fadd(i32 %arg) {			define i32 @fadd(i32 %arg) {
	; SSE2: cost of 2 {{.*}} %F32 = fadd			; SSE2: cost of 2 {{.*}} %F32 = fadd
	; SSE42: cost of 2 {{.*}} %F32 = fadd			; SSE42: cost of 1 {{.*}} %F32 = fadd
	; AVX: cost of 2 {{.*}} %F32 = fadd			; AVX: cost of 1 {{.*}} %F32 = fadd
	; AVX2: cost of 2 {{.*}} %F32 = fadd			; AVX2: cost of 1 {{.*}} %F32 = fadd
	; AVX512: cost of 2 {{.*}} %F32 = fadd			; AVX512: cost of 1 {{.*}} %F32 = fadd
	%F32 = fadd float undef, undef			%F32 = fadd float undef, undef
	; SSE2: cost of 2 {{.*}} %V4F32 = fadd			; SSE2: cost of 2 {{.*}} %V4F32 = fadd
	; SSE42: cost of 2 {{.*}} %V4F32 = fadd			; SSE42: cost of 1 {{.*}} %V4F32 = fadd
	; AVX: cost of 2 {{.*}} %V4F32 = fadd			; AVX: cost of 1 {{.*}} %V4F32 = fadd
	; AVX2: cost of 2 {{.*}} %V4F32 = fadd			; AVX2: cost of 1 {{.*}} %V4F32 = fadd
	; AVX512: cost of 2 {{.*}} %V4F32 = fadd			; AVX512: cost of 1 {{.*}} %V4F32 = fadd
	%V4F32 = fadd <4 x float> undef, undef			%V4F32 = fadd <4 x float> undef, undef
	; SSE2: cost of 4 {{.*}} %V8F32 = fadd			; SSE2: cost of 4 {{.*}} %V8F32 = fadd
	; SSE42: cost of 4 {{.*}} %V8F32 = fadd			; SSE42: cost of 2 {{.*}} %V8F32 = fadd
	; AVX: cost of 2 {{.*}} %V8F32 = fadd			; AVX: cost of 2 {{.*}} %V8F32 = fadd
	; AVX2: cost of 2 {{.*}} %V8F32 = fadd			; AVX2: cost of 1 {{.*}} %V8F32 = fadd
	; AVX512: cost of 2 {{.*}} %V8F32 = fadd			; AVX512: cost of 1 {{.*}} %V8F32 = fadd
	%V8F32 = fadd <8 x float> undef, undef			%V8F32 = fadd <8 x float> undef, undef
	; SSE2: cost of 8 {{.*}} %V16F32 = fadd			; SSE2: cost of 8 {{.*}} %V16F32 = fadd
	; SSE42: cost of 8 {{.*}} %V16F32 = fadd			; SSE42: cost of 4 {{.*}} %V16F32 = fadd
	; AVX: cost of 4 {{.*}} %V16F32 = fadd			; AVX: cost of 4 {{.*}} %V16F32 = fadd
	; AVX2: cost of 4 {{.*}} %V16F32 = fadd			; AVX2: cost of 2 {{.*}} %V16F32 = fadd
	; AVX512: cost of 2 {{.*}} %V16F32 = fadd			; AVX512: cost of 1 {{.*}} %V16F32 = fadd
	%V16F32 = fadd <16 x float> undef, undef			%V16F32 = fadd <16 x float> undef, undef

	; SSE2: cost of 2 {{.*}} %F64 = fadd			; SSE2: cost of 2 {{.*}} %F64 = fadd
	; SSE42: cost of 2 {{.*}} %F64 = fadd			; SSE42: cost of 1 {{.*}} %F64 = fadd
	; AVX: cost of 2 {{.*}} %F64 = fadd			; AVX: cost of 1 {{.*}} %F64 = fadd
	; AVX2: cost of 2 {{.*}} %F64 = fadd			; AVX2: cost of 1 {{.*}} %F64 = fadd
	; AVX512: cost of 2 {{.*}} %F64 = fadd			; AVX512: cost of 1 {{.*}} %F64 = fadd
	%F64 = fadd double undef, undef			%F64 = fadd double undef, undef
	; SSE2: cost of 2 {{.*}} %V2F64 = fadd			; SSE2: cost of 2 {{.*}} %V2F64 = fadd
	; SSE42: cost of 2 {{.*}} %V2F64 = fadd			; SSE42: cost of 1 {{.*}} %V2F64 = fadd
	; AVX: cost of 2 {{.*}} %V2F64 = fadd			; AVX: cost of 1 {{.*}} %V2F64 = fadd
	; AVX2: cost of 2 {{.*}} %V2F64 = fadd			; AVX2: cost of 1 {{.*}} %V2F64 = fadd
	; AVX512: cost of 2 {{.*}} %V2F64 = fadd			; AVX512: cost of 1 {{.*}} %V2F64 = fadd
	%V2F64 = fadd <2 x double> undef, undef			%V2F64 = fadd <2 x double> undef, undef
	; SSE2: cost of 4 {{.*}} %V4F64 = fadd			; SSE2: cost of 4 {{.*}} %V4F64 = fadd
	; SSE42: cost of 4 {{.*}} %V4F64 = fadd			; SSE42: cost of 2 {{.*}} %V4F64 = fadd
	; AVX: cost of 2 {{.*}} %V4F64 = fadd			; AVX: cost of 2 {{.*}} %V4F64 = fadd
	; AVX2: cost of 2 {{.*}} %V4F64 = fadd			; AVX2: cost of 1 {{.*}} %V4F64 = fadd
	; AVX512: cost of 2 {{.*}} %V4F64 = fadd			; AVX512: cost of 1 {{.*}} %V4F64 = fadd
	%V4F64 = fadd <4 x double> undef, undef			%V4F64 = fadd <4 x double> undef, undef
	; SSE2: cost of 8 {{.*}} %V8F64 = fadd			; SSE2: cost of 8 {{.*}} %V8F64 = fadd
	; SSE42: cost of 8 {{.*}} %V8F64 = fadd			; SSE42: cost of 4 {{.*}} %V8F64 = fadd
	; AVX: cost of 4 {{.*}} %V8F64 = fadd			; AVX: cost of 4 {{.*}} %V8F64 = fadd
	; AVX2: cost of 4 {{.*}} %V8F64 = fadd			; AVX2: cost of 2 {{.*}} %V8F64 = fadd
	; AVX512: cost of 2 {{.*}} %V8F64 = fadd			; AVX512: cost of 1 {{.*}} %V8F64 = fadd
	%V8F64 = fadd <8 x double> undef, undef			%V8F64 = fadd <8 x double> undef, undef

	ret i32 undef			ret i32 undef
	}			}

	; CHECK-LABEL: 'fsub'			; CHECK-LABEL: 'fsub'
	define i32 @fsub(i32 %arg) {			define i32 @fsub(i32 %arg) {
	; SSE2: cost of 2 {{.*}} %F32 = fsub			; SSE2: cost of 2 {{.*}} %F32 = fsub
	; SSE42: cost of 2 {{.*}} %F32 = fsub			; SSE42: cost of 1 {{.*}} %F32 = fsub
	; AVX: cost of 2 {{.*}} %F32 = fsub			; AVX: cost of 1 {{.*}} %F32 = fsub
	; AVX2: cost of 2 {{.*}} %F32 = fsub			; AVX2: cost of 1 {{.*}} %F32 = fsub
	; AVX512: cost of 2 {{.*}} %F32 = fsub			; AVX512: cost of 1 {{.*}} %F32 = fsub
	%F32 = fsub float undef, undef			%F32 = fsub float undef, undef
	; SSE2: cost of 2 {{.*}} %V4F32 = fsub			; SSE2: cost of 2 {{.*}} %V4F32 = fsub
	; SSE42: cost of 2 {{.*}} %V4F32 = fsub			; SSE42: cost of 1 {{.*}} %V4F32 = fsub
	; AVX: cost of 2 {{.*}} %V4F32 = fsub			; AVX: cost of 1 {{.*}} %V4F32 = fsub
	; AVX2: cost of 2 {{.*}} %V4F32 = fsub			; AVX2: cost of 1 {{.*}} %V4F32 = fsub
	; AVX512: cost of 2 {{.*}} %V4F32 = fsub			; AVX512: cost of 1 {{.*}} %V4F32 = fsub
	%V4F32 = fsub <4 x float> undef, undef			%V4F32 = fsub <4 x float> undef, undef
	; SSE2: cost of 4 {{.*}} %V8F32 = fsub			; SSE2: cost of 4 {{.*}} %V8F32 = fsub
	; SSE42: cost of 4 {{.*}} %V8F32 = fsub			; SSE42: cost of 2 {{.*}} %V8F32 = fsub
	; AVX: cost of 2 {{.*}} %V8F32 = fsub			; AVX: cost of 2 {{.*}} %V8F32 = fsub
	; AVX2: cost of 2 {{.*}} %V8F32 = fsub			; AVX2: cost of 1 {{.*}} %V8F32 = fsub
	; AVX512: cost of 2 {{.*}} %V8F32 = fsub			; AVX512: cost of 1 {{.*}} %V8F32 = fsub
	%V8F32 = fsub <8 x float> undef, undef			%V8F32 = fsub <8 x float> undef, undef
	; SSE2: cost of 8 {{.*}} %V16F32 = fsub			; SSE2: cost of 8 {{.*}} %V16F32 = fsub
	; SSE42: cost of 8 {{.*}} %V16F32 = fsub			; SSE42: cost of 4 {{.*}} %V16F32 = fsub
	; AVX: cost of 4 {{.*}} %V16F32 = fsub			; AVX: cost of 4 {{.*}} %V16F32 = fsub
	; AVX2: cost of 4 {{.*}} %V16F32 = fsub			; AVX2: cost of 2 {{.*}} %V16F32 = fsub
	; AVX512: cost of 2 {{.*}} %V16F32 = fsub			; AVX512: cost of 1 {{.*}} %V16F32 = fsub
	%V16F32 = fsub <16 x float> undef, undef			%V16F32 = fsub <16 x float> undef, undef

	; SSE2: cost of 2 {{.*}} %F64 = fsub			; SSE2: cost of 2 {{.*}} %F64 = fsub
	; SSE42: cost of 2 {{.*}} %F64 = fsub			; SSE42: cost of 1 {{.*}} %F64 = fsub
	; AVX: cost of 2 {{.*}} %F64 = fsub			; AVX: cost of 1 {{.*}} %F64 = fsub
	; AVX2: cost of 2 {{.*}} %F64 = fsub			; AVX2: cost of 1 {{.*}} %F64 = fsub
	; AVX512: cost of 2 {{.*}} %F64 = fsub			; AVX512: cost of 1 {{.*}} %F64 = fsub
	%F64 = fsub double undef, undef			%F64 = fsub double undef, undef
	; SSE2: cost of 2 {{.*}} %V2F64 = fsub			; SSE2: cost of 2 {{.*}} %V2F64 = fsub
	; SSE42: cost of 2 {{.*}} %V2F64 = fsub			; SSE42: cost of 1 {{.*}} %V2F64 = fsub
	; AVX: cost of 2 {{.*}} %V2F64 = fsub			; AVX: cost of 1 {{.*}} %V2F64 = fsub
	; AVX2: cost of 2 {{.*}} %V2F64 = fsub			; AVX2: cost of 1 {{.*}} %V2F64 = fsub
	; AVX512: cost of 2 {{.*}} %V2F64 = fsub			; AVX512: cost of 1 {{.*}} %V2F64 = fsub
	%V2F64 = fsub <2 x double> undef, undef			%V2F64 = fsub <2 x double> undef, undef
	; SSE2: cost of 4 {{.*}} %V4F64 = fsub			; SSE2: cost of 4 {{.*}} %V4F64 = fsub
	; SSE42: cost of 4 {{.*}} %V4F64 = fsub			; SSE42: cost of 2 {{.*}} %V4F64 = fsub
	; AVX: cost of 2 {{.*}} %V4F64 = fsub			; AVX: cost of 2 {{.*}} %V4F64 = fsub
	; AVX2: cost of 2 {{.*}} %V4F64 = fsub			; AVX2: cost of 1 {{.*}} %V4F64 = fsub
	; AVX512: cost of 2 {{.*}} %V4F64 = fsub			; AVX512: cost of 1 {{.*}} %V4F64 = fsub
	%V4F64 = fsub <4 x double> undef, undef			%V4F64 = fsub <4 x double> undef, undef
	; SSE2: cost of 8 {{.*}} %V8F64 = fsub			; SSE2: cost of 8 {{.*}} %V8F64 = fsub
	; SSE42: cost of 8 {{.*}} %V8F64 = fsub			; SSE42: cost of 4 {{.*}} %V8F64 = fsub
	; AVX: cost of 4 {{.*}} %V8F64 = fsub			; AVX: cost of 4 {{.*}} %V8F64 = fsub
	; AVX2: cost of 4 {{.*}} %V8F64 = fsub			; AVX2: cost of 2 {{.*}} %V8F64 = fsub
	; AVX512: cost of 2 {{.*}} %V8F64 = fsub			; AVX512: cost of 1 {{.*}} %V8F64 = fsub
	%V8F64 = fsub <8 x double> undef, undef			%V8F64 = fsub <8 x double> undef, undef

	ret i32 undef			ret i32 undef
	}			}

	; CHECK-LABEL: 'fmul'			; CHECK-LABEL: 'fmul'
	define i32 @fmul(i32 %arg) {			define i32 @fmul(i32 %arg) {
	; SSE2: cost of 2 {{.*}} %F32 = fmul			; SSE2: cost of 2 {{.*}} %F32 = fmul
	; SSE42: cost of 2 {{.*}} %F32 = fmul			; SSE42: cost of 1 {{.*}} %F32 = fmul
	; AVX: cost of 2 {{.*}} %F32 = fmul			; AVX: cost of 1 {{.*}} %F32 = fmul
	; AVX2: cost of 2 {{.*}} %F32 = fmul			; AVX2: cost of 1 {{.*}} %F32 = fmul
	; AVX512: cost of 2 {{.*}} %F32 = fmul			; AVX512: cost of 1 {{.*}} %F32 = fmul
	%F32 = fmul float undef, undef			%F32 = fmul float undef, undef
	; SSE2: cost of 2 {{.*}} %V4F32 = fmul			; SSE2: cost of 2 {{.*}} %V4F32 = fmul
	; SSE42: cost of 2 {{.*}} %V4F32 = fmul			; SSE42: cost of 1 {{.*}} %V4F32 = fmul
	; AVX: cost of 2 {{.*}} %V4F32 = fmul			; AVX: cost of 1 {{.*}} %V4F32 = fmul
	; AVX2: cost of 2 {{.*}} %V4F32 = fmul			; AVX2: cost of 1 {{.*}} %V4F32 = fmul
	; AVX512: cost of 2 {{.*}} %V4F32 = fmul			; AVX512: cost of 1 {{.*}} %V4F32 = fmul
	%V4F32 = fmul <4 x float> undef, undef			%V4F32 = fmul <4 x float> undef, undef
	; SSE2: cost of 4 {{.*}} %V8F32 = fmul			; SSE2: cost of 4 {{.*}} %V8F32 = fmul
	; SSE42: cost of 4 {{.*}} %V8F32 = fmul			; SSE42: cost of 2 {{.*}} %V8F32 = fmul
	; AVX: cost of 2 {{.*}} %V8F32 = fmul			; AVX: cost of 2 {{.*}} %V8F32 = fmul
	; AVX2: cost of 2 {{.*}} %V8F32 = fmul			; AVX2: cost of 1 {{.*}} %V8F32 = fmul
	; AVX512: cost of 2 {{.*}} %V8F32 = fmul			; AVX512: cost of 1 {{.*}} %V8F32 = fmul
	%V8F32 = fmul <8 x float> undef, undef			%V8F32 = fmul <8 x float> undef, undef
	; SSE2: cost of 8 {{.*}} %V16F32 = fmul			; SSE2: cost of 8 {{.*}} %V16F32 = fmul
	; SSE42: cost of 8 {{.*}} %V16F32 = fmul			; SSE42: cost of 4 {{.*}} %V16F32 = fmul
	; AVX: cost of 4 {{.*}} %V16F32 = fmul			; AVX: cost of 4 {{.*}} %V16F32 = fmul
	; AVX2: cost of 4 {{.*}} %V16F32 = fmul			; AVX2: cost of 2 {{.*}} %V16F32 = fmul
	; AVX512: cost of 2 {{.*}} %V16F32 = fmul			; AVX512: cost of 1 {{.*}} %V16F32 = fmul
	%V16F32 = fmul <16 x float> undef, undef			%V16F32 = fmul <16 x float> undef, undef

	; SSE2: cost of 2 {{.*}} %F64 = fmul			; SSE2: cost of 2 {{.*}} %F64 = fmul
	; SSE42: cost of 2 {{.*}} %F64 = fmul			; SSE42: cost of 1 {{.*}} %F64 = fmul
	; AVX: cost of 2 {{.*}} %F64 = fmul			; AVX: cost of 1 {{.*}} %F64 = fmul
	; AVX2: cost of 2 {{.*}} %F64 = fmul			; AVX2: cost of 1 {{.*}} %F64 = fmul
	; AVX512: cost of 2 {{.*}} %F64 = fmul			; AVX512: cost of 1 {{.*}} %F64 = fmul
	%F64 = fmul double undef, undef			%F64 = fmul double undef, undef
	; SSE2: cost of 2 {{.*}} %V2F64 = fmul			; SSE2: cost of 2 {{.*}} %V2F64 = fmul
	; SSE42: cost of 2 {{.*}} %V2F64 = fmul			; SSE42: cost of 1 {{.*}} %V2F64 = fmul
	; AVX: cost of 2 {{.*}} %V2F64 = fmul			; AVX: cost of 1 {{.*}} %V2F64 = fmul
	; AVX2: cost of 2 {{.*}} %V2F64 = fmul			; AVX2: cost of 1 {{.*}} %V2F64 = fmul
	; AVX512: cost of 2 {{.*}} %V2F64 = fmul			; AVX512: cost of 1 {{.*}} %V2F64 = fmul
	%V2F64 = fmul <2 x double> undef, undef			%V2F64 = fmul <2 x double> undef, undef
	; SSE2: cost of 4 {{.*}} %V4F64 = fmul			; SSE2: cost of 4 {{.*}} %V4F64 = fmul
	; SSE42: cost of 4 {{.*}} %V4F64 = fmul			; SSE42: cost of 2 {{.*}} %V4F64 = fmul
	; AVX: cost of 2 {{.*}} %V4F64 = fmul			; AVX: cost of 2 {{.*}} %V4F64 = fmul
	; AVX2: cost of 2 {{.*}} %V4F64 = fmul			; AVX2: cost of 1 {{.*}} %V4F64 = fmul
	; AVX512: cost of 2 {{.*}} %V4F64 = fmul			; AVX512: cost of 1 {{.*}} %V4F64 = fmul
	%V4F64 = fmul <4 x double> undef, undef			%V4F64 = fmul <4 x double> undef, undef
	; SSE2: cost of 8 {{.*}} %V8F64 = fmul			; SSE2: cost of 8 {{.*}} %V8F64 = fmul
	; SSE42: cost of 8 {{.*}} %V8F64 = fmul			; SSE42: cost of 4 {{.*}} %V8F64 = fmul
	; AVX: cost of 4 {{.*}} %V8F64 = fmul			; AVX: cost of 4 {{.*}} %V8F64 = fmul
	; AVX2: cost of 4 {{.*}} %V8F64 = fmul			; AVX2: cost of 2 {{.*}} %V8F64 = fmul
	; AVX512: cost of 2 {{.*}} %V8F64 = fmul			; AVX512: cost of 1 {{.*}} %V8F64 = fmul
	%V8F64 = fmul <8 x double> undef, undef			%V8F64 = fmul <8 x double> undef, undef

	ret i32 undef			ret i32 undef
	}			}

	; CHECK-LABEL: 'fdiv'			; CHECK-LABEL: 'fdiv'
	define i32 @fdiv(i32 %arg) {			define i32 @fdiv(i32 %arg) {
	; SSE2: cost of 23 {{.*}} %F32 = fdiv			; SSE2: cost of 23 {{.*}} %F32 = fdiv
	▲ Show 20 Lines • Show All 366 Lines • Show Last 20 Lines

test/Analysis/CostModel/X86/intrinsic-cost.ll

	Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines

	for.end: ; preds = %vector.body			for.end: ; preds = %vector.body
	ret void			ret void

	; CORE2: Printing analysis 'Cost Model Analysis' for function 'test3':			; CORE2: Printing analysis 'Cost Model Analysis' for function 'test3':
	; CORE2: Cost Model: Found an estimated cost of 4 for instruction: %2 = call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %wide.load, <4 x float> %b, <4 x float> %c)			; CORE2: Cost Model: Found an estimated cost of 4 for instruction: %2 = call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %wide.load, <4 x float> %b, <4 x float> %c)

	; COREI7: Printing analysis 'Cost Model Analysis' for function 'test3':			; COREI7: Printing analysis 'Cost Model Analysis' for function 'test3':
	; COREI7: Cost Model: Found an estimated cost of 4 for instruction: %2 = call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %wide.load, <4 x float> %b, <4 x float> %c)			; COREI7: Cost Model: Found an estimated cost of 2 for instruction: %2 = call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %wide.load, <4 x float> %b, <4 x float> %c)

	}			}

	declare <4 x float> @llvm.fmuladd.v4f32(<4 x float>, <4 x float>, <4 x float>) nounwind readnone			declare <4 x float> @llvm.fmuladd.v4f32(<4 x float>, <4 x float>, <4 x float>) nounwind readnone

test/Transforms/SLPVectorizer/X86/PR36280.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 \| FileCheck %s			; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 \| FileCheck %s

	define float @jacobi(float* %p, float %x, float %y, float %z) {			define float @jacobi(float* %p, float %x, float %y, float %z) {
	; CHECK-LABEL: @jacobi(			; CHECK-LABEL: @jacobi(
	; CHECK-NEXT: [[GEP1:%.]] = getelementptr float, float [[P:%.*]], i64 1			; CHECK-NEXT: [[GEP1:%.]] = getelementptr float, float [[P:%.*]], i64 1
	; CHECK-NEXT: [[GEP2:%.]] = getelementptr float, float [[P]], i64 2			; CHECK-NEXT: [[GEP2:%.]] = getelementptr float, float [[P]], i64 2
	; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[GEP1]] to <2 x float>*			; CHECK-NEXT: [[P1:%.]] = load float, float [[GEP1]]
	; CHECK-NEXT: [[TMP2:%.]] = load <2 x float>, <2 x float> [[TMP1]], align 4			; CHECK-NEXT: [[P2:%.]] = load float, float [[GEP2]]
	; CHECK-NEXT: [[TMP3:%.]] = insertelement <2 x float> undef, float [[X:%.]], i32 0			; CHECK-NEXT: [[MUL1:%.]] = fmul float [[P1]], [[X:%.]]
	; CHECK-NEXT: [[TMP4:%.]] = insertelement <2 x float> [[TMP3]], float [[Y:%.]], i32 1			; CHECK-NEXT: [[MUL2:%.]] = fmul float [[P2]], [[Y:%.]]
	; CHECK-NEXT: [[TMP5:%.*]] = fmul <2 x float> [[TMP4]], [[TMP2]]			; CHECK-NEXT: [[ADD1:%.]] = fadd float [[MUL1]], [[Z:%.]]
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x float> [[TMP5]], i32 0			; CHECK-NEXT: [[ADD2:%.*]] = fadd float [[MUL2]], [[ADD1]]
	; CHECK-NEXT: [[ADD1:%.]] = fadd float [[TMP6]], [[Z:%.]]
	; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x float> [[TMP5]], i32 1
	; CHECK-NEXT: [[ADD2:%.*]] = fadd float [[TMP7]], [[ADD1]]
	; CHECK-NEXT: ret float [[ADD2]]			; CHECK-NEXT: ret float [[ADD2]]
	;			;
	%gep1 = getelementptr float, float* %p, i64 1			%gep1 = getelementptr float, float* %p, i64 1
	%gep2 = getelementptr float, float* %p, i64 2			%gep2 = getelementptr float, float* %p, i64 2
	%p1 = load float, float* %gep1			%p1 = load float, float* %gep1
	%p2 = load float, float* %gep2			%p2 = load float, float* %gep2
	%mul1 = fmul float %p1, %x			%mul1 = fmul float %p1, %x
	%mul2 = fmul float %p2, %y			%mul2 = fmul float %p2, %y
	%add1 = fadd float %mul1, %z			%add1 = fadd float %mul1, %z
	%add2 = fadd float %mul2, %add1			%add2 = fadd float %mul2, %add1
	ret float %add2			ret float %add2
	}			}

test/Transforms/SLPVectorizer/X86/cse.ll

	Show All 13 Lines
	define i32 @test(double* nocapture %G) {			define i32 @test(double* nocapture %G) {
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds double, double [[G:%.*]], i64 5			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds double, double [[G:%.*]], i64 5
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds double, double [[G]], i64 6			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds double, double [[G]], i64 6
	; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[ARRAYIDX]] to <2 x double>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[ARRAYIDX]] to <2 x double>*
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8			; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
	; CHECK-NEXT: [[TMP2:%.*]] = fmul <2 x double> <double 4.000000e+00, double 3.000000e+00>, [[TMP1]]			; CHECK-NEXT: [[TMP2:%.*]] = fmul <2 x double> <double 4.000000e+00, double 3.000000e+00>, [[TMP1]]
				; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> <double 1.000000e+00, double 6.000000e+00>, [[TMP2]]
	; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[G]], i64 1			; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[G]], i64 1
				; CHECK-NEXT: [[TMP4:%.]] = bitcast double [[G]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 8
				; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[TMP2]], i32 0
				; CHECK-NEXT: [[ADD8:%.*]] = fadd double [[TMP5]], 7.000000e+00
	; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds double, double [[G]], i64 2			; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds double, double [[G]], i64 2
	; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x double> [[TMP1]], i32 1			; CHECK-NEXT: store double [[ADD8]], double* [[ARRAYIDX9]], align 8
	; CHECK-NEXT: [[MUL11:%.*]] = fmul double [[TMP3]], 4.000000e+00			; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[TMP1]], i32 1
	; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x double> [[TMP2]], i32 0			; CHECK-NEXT: [[MUL11:%.*]] = fmul double [[TMP6]], 4.000000e+00
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x double> undef, double [[TMP4]], i32 0			; CHECK-NEXT: [[ADD12:%.*]] = fadd double [[MUL11]], 8.000000e+00
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[TMP2]], i32 1
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x double> [[TMP5]], double [[TMP6]], i32 1
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x double> [[TMP7]], double [[TMP4]], i32 2
	; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x double> [[TMP8]], double [[MUL11]], i32 3
	; CHECK-NEXT: [[TMP10:%.*]] = fadd <4 x double> <double 1.000000e+00, double 6.000000e+00, double 7.000000e+00, double 8.000000e+00>, [[TMP9]]
	; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds double, double [[G]], i64 3			; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds double, double [[G]], i64 3
	; CHECK-NEXT: [[TMP11:%.]] = bitcast double [[G]] to <4 x double>*			; CHECK-NEXT: store double [[ADD12]], double* [[ARRAYIDX13]], align 8
	; CHECK-NEXT: store <4 x double> [[TMP10]], <4 x double>* [[TMP11]], align 8
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	entry:			entry:
	%arrayidx = getelementptr inbounds double, double* %G, i64 5			%arrayidx = getelementptr inbounds double, double* %G, i64 5
	%0 = load double, double* %arrayidx, align 8			%0 = load double, double* %arrayidx, align 8
	%mul = fmul double %0, 4.000000e+00			%mul = fmul double %0, 4.000000e+00
	%add = fadd double %mul, 1.000000e+00			%add = fadd double %mul, 1.000000e+00
	store double %add, double* %G, align 8			store double %add, double* %G, align 8
	▲ Show 20 Lines • Show All 320 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/horizontal.ll

	Show First 20 Lines • Show All 724 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[CMP1495:%.]] = icmp eq i32 [[ARG_B:%.]], 0			; CHECK-NEXT: [[CMP1495:%.]] = icmp eq i32 [[ARG_B:%.]], 0
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.cond.cleanup:			; CHECK: for.cond.cleanup:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_COND_CLEANUP15:%.]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_COND_CLEANUP15:%.]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = shl i64 [[INDVARS_IV]], 2			; CHECK-NEXT: [[TMP0:%.*]] = shl i64 [[INDVARS_IV]], 2
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[ARRAY:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[ARRAY:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP1:%.*]] = or i64 [[TMP0]], 1			; CHECK-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX]], align 4
	; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP1]]			; CHECK-NEXT: [[TMP2:%.*]] = or i64 [[TMP0]], 1
	; CHECK-NEXT: [[TMP2:%.*]] = or i64 [[TMP0]], 2			; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP2]]
	; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP2]]			; CHECK-NEXT: [[TMP3:%.]] = load float, float [[ARRAYIDX4]], align 4
	; CHECK-NEXT: [[TMP3:%.*]] = or i64 [[TMP0]], 3			; CHECK-NEXT: [[TMP4:%.*]] = or i64 [[TMP0]], 2
	; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP3]]			; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP4]]
	; CHECK-NEXT: [[TMP4:%.]] = bitcast float [[ARRAYIDX]] to <4 x float>*			; CHECK-NEXT: [[TMP5:%.]] = load float, float [[ARRAYIDX8]], align 4
	; CHECK-NEXT: [[TMP5:%.]] = load <4 x float>, <4 x float> [[TMP4]], align 4			; CHECK-NEXT: [[TMP6:%.*]] = or i64 [[TMP0]], 3
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <4 x float> [[TMP5]], i32 0			; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP6]]
	; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x float> [[TMP5]], i32 1			; CHECK-NEXT: [[TMP7:%.]] = load float, float [[ARRAYIDX12]], align 4
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <4 x float> [[TMP5]], i32 2
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x float> [[TMP5]], i32 3
	; CHECK-NEXT: br i1 [[CMP1495]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16_LR_PH:%.*]]			; CHECK-NEXT: br i1 [[CMP1495]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16_LR_PH:%.*]]
	; CHECK: for.body16.lr.ph:			; CHECK: for.body16.lr.ph:
	; CHECK-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds float, float [[ARG_A:%.*]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds float, float [[ARG_A:%.*]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ADD_PTR]], align 4			; CHECK-NEXT: [[TMP8:%.]] = load float, float [[ADD_PTR]], align 4
	; CHECK-NEXT: br label [[FOR_BODY16:%.*]]			; CHECK-NEXT: br label [[FOR_BODY16:%.*]]
	; CHECK: for.cond.cleanup15:			; CHECK: for.cond.cleanup15:
	; CHECK-NEXT: [[W2_0_LCSSA:%.]] = phi float [ [[TMP8]], [[FOR_BODY]] ], [ [[TMP21:%.]], [[FOR_BODY16]] ]			; CHECK-NEXT: [[W2_0_LCSSA:%.]] = phi float [ [[TMP5]], [[FOR_BODY]] ], [ [[SUB28:%.]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[W3_0_LCSSA:%.]] = phi float [ [[TMP9]], [[FOR_BODY]] ], [ [[TMP25:%.]], [[FOR_BODY16]] ]			; CHECK-NEXT: [[W3_0_LCSSA:%.]] = phi float [ [[TMP7]], [[FOR_BODY]] ], [ [[W2_096:%.]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[W1_0_LCSSA:%.]] = phi float [ [[TMP7]], [[FOR_BODY]] ], [ [[TMP12:%.]], [[FOR_BODY16]] ]			; CHECK-NEXT: [[W1_0_LCSSA:%.]] = phi float [ [[TMP3]], [[FOR_BODY]] ], [ [[W0_0100:%.]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[W0_0_LCSSA:%.]] = phi float [ [[TMP6]], [[FOR_BODY]] ], [ [[SUB19:%.]], [[FOR_BODY16]] ]			; CHECK-NEXT: [[W0_0_LCSSA:%.]] = phi float [ [[TMP1]], [[FOR_BODY]] ], [ [[SUB19:%.]], [[FOR_BODY16]] ]
	; CHECK-NEXT: store float [[W0_0_LCSSA]], float* [[ARRAYIDX]], align 4			; CHECK-NEXT: store float [[W0_0_LCSSA]], float* [[ARRAYIDX]], align 4
	; CHECK-NEXT: store float [[W1_0_LCSSA]], float* [[ARRAYIDX4]], align 4			; CHECK-NEXT: store float [[W1_0_LCSSA]], float* [[ARRAYIDX4]], align 4
	; CHECK-NEXT: store float [[W2_0_LCSSA]], float* [[ARRAYIDX8]], align 4			; CHECK-NEXT: store float [[W2_0_LCSSA]], float* [[ARRAYIDX8]], align 4
	; CHECK-NEXT: store float [[W3_0_LCSSA]], float* [[ARRAYIDX12]], align 4			; CHECK-NEXT: store float [[W3_0_LCSSA]], float* [[ARRAYIDX12]], align 4
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND109:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 6			; CHECK-NEXT: [[EXITCOND109:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 6
	; CHECK-NEXT: br i1 [[EXITCOND109]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]			; CHECK-NEXT: br i1 [[EXITCOND109]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
	; CHECK: for.body16:			; CHECK: for.body16:
				; CHECK-NEXT: [[W0_0100]] = phi float [ [[TMP1]], [[FOR_BODY16_LR_PH]] ], [ [[SUB19]], [[FOR_BODY16]] ]
				; CHECK-NEXT: [[W1_099:%.*]] = phi float [ [[TMP3]], [[FOR_BODY16_LR_PH]] ], [ [[W0_0100]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[J_098:%.]] = phi i32 [ 0, [[FOR_BODY16_LR_PH]] ], [ [[INC:%.]], [[FOR_BODY16]] ]			; CHECK-NEXT: [[J_098:%.]] = phi i32 [ 0, [[FOR_BODY16_LR_PH]] ], [ [[INC:%.]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[TMP11:%.]] = phi <4 x float> [ [[TMP5]], [[FOR_BODY16_LR_PH]] ], [ [[TMP26:%.]], [[FOR_BODY16]] ]			; CHECK-NEXT: [[W3_097:%.*]] = phi float [ [[TMP7]], [[FOR_BODY16_LR_PH]] ], [ [[W2_096]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[TMP12]] = extractelement <4 x float> [[TMP11]], i32 0			; CHECK-NEXT: [[W2_096]] = phi float [ [[TMP5]], [[FOR_BODY16_LR_PH]] ], [ [[SUB28]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x float> [[TMP11]], i32 1			; CHECK-NEXT: [[MUL17:%.*]] = fmul fast float [[W0_0100]], 0x3FF19999A0000000
	; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x float> undef, float [[TMP12]], i32 0			; CHECK-NEXT: [[MUL18_NEG:%.*]] = fmul fast float [[W1_099]], 0xBFF3333340000000
	; CHECK-NEXT: [[TMP15:%.*]] = insertelement <2 x float> [[TMP14]], float [[TMP13]], i32 1			; CHECK-NEXT: [[SUB92:%.*]] = fadd fast float [[MUL17]], [[MUL18_NEG]]
	; CHECK-NEXT: [[TMP16:%.*]] = fmul fast <2 x float> <float 0x3FF19999A0000000, float 0xBFF3333340000000>, [[TMP15]]			; CHECK-NEXT: [[SUB19]] = fadd fast float [[SUB92]], [[TMP8]]
	; CHECK-NEXT: [[TMP17:%.*]] = extractelement <2 x float> [[TMP16]], i32 0
	; CHECK-NEXT: [[TMP18:%.*]] = extractelement <2 x float> [[TMP16]], i32 1
	; CHECK-NEXT: [[SUB92:%.*]] = fadd fast float [[TMP17]], [[TMP18]]
	; CHECK-NEXT: [[SUB19]] = fadd fast float [[SUB92]], [[TMP10]]
	; CHECK-NEXT: [[MUL20:%.*]] = fmul fast float [[SUB19]], 0x4000CCCCC0000000			; CHECK-NEXT: [[MUL20:%.*]] = fmul fast float [[SUB19]], 0x4000CCCCC0000000
	; CHECK-NEXT: [[TMP19:%.*]] = fmul fast <4 x float> <float 0xC0019999A0000000, float 0x4002666660000000, float 0x4008CCCCC0000000, float 0xC0099999A0000000>, [[TMP11]]			; CHECK-NEXT: [[MUL21_NEG:%.*]] = fmul fast float [[W0_0100]], 0xC0019999A0000000
	; CHECK-NEXT: [[ADD2293:%.*]] = fadd fast float undef, undef			; CHECK-NEXT: [[MUL23:%.*]] = fmul fast float [[W1_099]], 0x4002666660000000
	; CHECK-NEXT: [[ADD24:%.*]] = fadd fast float [[ADD2293]], undef			; CHECK-NEXT: [[MUL25:%.*]] = fmul fast float [[W2_096]], 0x4008CCCCC0000000
	; CHECK-NEXT: [[SUB2694:%.*]] = fadd fast float [[ADD24]], undef			; CHECK-NEXT: [[MUL27_NEG:%.*]] = fmul fast float [[W3_097]], 0xC0099999A0000000
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <4 x float> [[TMP19]], <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>			; CHECK-NEXT: [[ADD2293:%.*]] = fadd fast float [[MUL27_NEG]], [[MUL25]]
	; CHECK-NEXT: [[BIN_RDX:%.*]] = fadd fast <4 x float> [[TMP19]], [[RDX_SHUF]]			; CHECK-NEXT: [[ADD24:%.*]] = fadd fast float [[ADD2293]], [[MUL23]]
	; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[SUB2694:%.*]] = fadd fast float [[ADD24]], [[MUL21_NEG]]
	; CHECK-NEXT: [[BIN_RDX2:%.*]] = fadd fast <4 x float> [[BIN_RDX]], [[RDX_SHUF1]]			; CHECK-NEXT: [[SUB28]] = fadd fast float [[SUB2694]], [[MUL20]]
	; CHECK-NEXT: [[TMP20:%.*]] = extractelement <4 x float> [[BIN_RDX2]], i32 0
	; CHECK-NEXT: [[TMP21]] = fadd fast float [[TMP20]], [[MUL20]]
	; CHECK-NEXT: [[SUB28:%.*]] = fadd fast float [[SUB2694]], [[MUL20]]
	; CHECK-NEXT: [[INC]] = add nuw i32 [[J_098]], 1			; CHECK-NEXT: [[INC]] = add nuw i32 [[J_098]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[ARG_B]]			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[ARG_B]]
	; CHECK-NEXT: [[TMP22:%.*]] = insertelement <4 x float> undef, float [[SUB19]], i32 0
	; CHECK-NEXT: [[TMP23:%.*]] = insertelement <4 x float> [[TMP22]], float [[TMP12]], i32 1
	; CHECK-NEXT: [[TMP24:%.*]] = insertelement <4 x float> [[TMP23]], float [[TMP21]], i32 2
	; CHECK-NEXT: [[TMP25]] = extractelement <4 x float> [[TMP11]], i32 2
	; CHECK-NEXT: [[TMP26]] = insertelement <4 x float> [[TMP24]], float [[TMP25]], i32 3
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16]]
	;			;
	; STORE-LABEL: @foo(			; STORE-LABEL: @foo(
	; STORE-NEXT: entry:			; STORE-NEXT: entry:
	; STORE-NEXT: [[CMP1495:%.]] = icmp eq i32 [[ARG_B:%.]], 0			; STORE-NEXT: [[CMP1495:%.]] = icmp eq i32 [[ARG_B:%.]], 0
	; STORE-NEXT: br label [[FOR_BODY:%.*]]			; STORE-NEXT: br label [[FOR_BODY:%.*]]
	; STORE: for.cond.cleanup:			; STORE: for.cond.cleanup:
	; STORE-NEXT: ret void			; STORE-NEXT: ret void
	; STORE: for.body:			; STORE: for.body:
	; STORE-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_COND_CLEANUP15:%.]] ]			; STORE-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_COND_CLEANUP15:%.]] ]
	; STORE-NEXT: [[TMP0:%.*]] = shl i64 [[INDVARS_IV]], 2			; STORE-NEXT: [[TMP0:%.*]] = shl i64 [[INDVARS_IV]], 2
	; STORE-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[ARRAY:%.*]], i64 [[TMP0]]			; STORE-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[ARRAY:%.*]], i64 [[TMP0]]
	; STORE-NEXT: [[TMP1:%.*]] = or i64 [[TMP0]], 1			; STORE-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX]], align 4
	; STORE-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP1]]			; STORE-NEXT: [[TMP2:%.*]] = or i64 [[TMP0]], 1
	; STORE-NEXT: [[TMP2:%.*]] = or i64 [[TMP0]], 2			; STORE-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP2]]
	; STORE-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP2]]			; STORE-NEXT: [[TMP3:%.]] = load float, float [[ARRAYIDX4]], align 4
	; STORE-NEXT: [[TMP3:%.*]] = or i64 [[TMP0]], 3			; STORE-NEXT: [[TMP4:%.*]] = or i64 [[TMP0]], 2
	; STORE-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP3]]			; STORE-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP4]]
	; STORE-NEXT: [[TMP4:%.]] = bitcast float [[ARRAYIDX]] to <4 x float>*			; STORE-NEXT: [[TMP5:%.]] = load float, float [[ARRAYIDX8]], align 4
	; STORE-NEXT: [[TMP5:%.]] = load <4 x float>, <4 x float> [[TMP4]], align 4			; STORE-NEXT: [[TMP6:%.*]] = or i64 [[TMP0]], 3
	; STORE-NEXT: [[TMP6:%.*]] = extractelement <4 x float> [[TMP5]], i32 0			; STORE-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP6]]
	; STORE-NEXT: [[TMP7:%.*]] = extractelement <4 x float> [[TMP5]], i32 1			; STORE-NEXT: [[TMP7:%.]] = load float, float [[ARRAYIDX12]], align 4
	; STORE-NEXT: [[TMP8:%.*]] = extractelement <4 x float> [[TMP5]], i32 2
	; STORE-NEXT: [[TMP9:%.*]] = extractelement <4 x float> [[TMP5]], i32 3
	; STORE-NEXT: br i1 [[CMP1495]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16_LR_PH:%.*]]			; STORE-NEXT: br i1 [[CMP1495]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16_LR_PH:%.*]]
	; STORE: for.body16.lr.ph:			; STORE: for.body16.lr.ph:
	; STORE-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds float, float [[ARG_A:%.*]], i64 [[INDVARS_IV]]			; STORE-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds float, float [[ARG_A:%.*]], i64 [[INDVARS_IV]]
	; STORE-NEXT: [[TMP10:%.]] = load float, float [[ADD_PTR]], align 4			; STORE-NEXT: [[TMP8:%.]] = load float, float [[ADD_PTR]], align 4
	; STORE-NEXT: br label [[FOR_BODY16:%.*]]			; STORE-NEXT: br label [[FOR_BODY16:%.*]]
	; STORE: for.cond.cleanup15:			; STORE: for.cond.cleanup15:
	; STORE-NEXT: [[W2_0_LCSSA:%.]] = phi float [ [[TMP8]], [[FOR_BODY]] ], [ [[TMP21:%.]], [[FOR_BODY16]] ]			; STORE-NEXT: [[W2_0_LCSSA:%.]] = phi float [ [[TMP5]], [[FOR_BODY]] ], [ [[SUB28:%.]], [[FOR_BODY16]] ]
	; STORE-NEXT: [[W3_0_LCSSA:%.]] = phi float [ [[TMP9]], [[FOR_BODY]] ], [ [[TMP25:%.]], [[FOR_BODY16]] ]			; STORE-NEXT: [[W3_0_LCSSA:%.]] = phi float [ [[TMP7]], [[FOR_BODY]] ], [ [[W2_096:%.]], [[FOR_BODY16]] ]
	; STORE-NEXT: [[W1_0_LCSSA:%.]] = phi float [ [[TMP7]], [[FOR_BODY]] ], [ [[TMP12:%.]], [[FOR_BODY16]] ]			; STORE-NEXT: [[W1_0_LCSSA:%.]] = phi float [ [[TMP3]], [[FOR_BODY]] ], [ [[W0_0100:%.]], [[FOR_BODY16]] ]
	; STORE-NEXT: [[W0_0_LCSSA:%.]] = phi float [ [[TMP6]], [[FOR_BODY]] ], [ [[SUB19:%.]], [[FOR_BODY16]] ]			; STORE-NEXT: [[W0_0_LCSSA:%.]] = phi float [ [[TMP1]], [[FOR_BODY]] ], [ [[SUB19:%.]], [[FOR_BODY16]] ]
	; STORE-NEXT: store float [[W0_0_LCSSA]], float* [[ARRAYIDX]], align 4			; STORE-NEXT: store float [[W0_0_LCSSA]], float* [[ARRAYIDX]], align 4
	; STORE-NEXT: store float [[W1_0_LCSSA]], float* [[ARRAYIDX4]], align 4			; STORE-NEXT: store float [[W1_0_LCSSA]], float* [[ARRAYIDX4]], align 4
	; STORE-NEXT: store float [[W2_0_LCSSA]], float* [[ARRAYIDX8]], align 4			; STORE-NEXT: store float [[W2_0_LCSSA]], float* [[ARRAYIDX8]], align 4
	; STORE-NEXT: store float [[W3_0_LCSSA]], float* [[ARRAYIDX12]], align 4			; STORE-NEXT: store float [[W3_0_LCSSA]], float* [[ARRAYIDX12]], align 4
	; STORE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; STORE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; STORE-NEXT: [[EXITCOND109:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 6			; STORE-NEXT: [[EXITCOND109:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 6
	; STORE-NEXT: br i1 [[EXITCOND109]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]			; STORE-NEXT: br i1 [[EXITCOND109]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
	; STORE: for.body16:			; STORE: for.body16:
				; STORE-NEXT: [[W0_0100]] = phi float [ [[TMP1]], [[FOR_BODY16_LR_PH]] ], [ [[SUB19]], [[FOR_BODY16]] ]
				; STORE-NEXT: [[W1_099:%.*]] = phi float [ [[TMP3]], [[FOR_BODY16_LR_PH]] ], [ [[W0_0100]], [[FOR_BODY16]] ]
	; STORE-NEXT: [[J_098:%.]] = phi i32 [ 0, [[FOR_BODY16_LR_PH]] ], [ [[INC:%.]], [[FOR_BODY16]] ]			; STORE-NEXT: [[J_098:%.]] = phi i32 [ 0, [[FOR_BODY16_LR_PH]] ], [ [[INC:%.]], [[FOR_BODY16]] ]
	; STORE-NEXT: [[TMP11:%.]] = phi <4 x float> [ [[TMP5]], [[FOR_BODY16_LR_PH]] ], [ [[TMP26:%.]], [[FOR_BODY16]] ]			; STORE-NEXT: [[W3_097:%.*]] = phi float [ [[TMP7]], [[FOR_BODY16_LR_PH]] ], [ [[W2_096]], [[FOR_BODY16]] ]
	; STORE-NEXT: [[TMP12]] = extractelement <4 x float> [[TMP11]], i32 0			; STORE-NEXT: [[W2_096]] = phi float [ [[TMP5]], [[FOR_BODY16_LR_PH]] ], [ [[SUB28]], [[FOR_BODY16]] ]
	; STORE-NEXT: [[TMP13:%.*]] = extractelement <4 x float> [[TMP11]], i32 1			; STORE-NEXT: [[MUL17:%.*]] = fmul fast float [[W0_0100]], 0x3FF19999A0000000
	; STORE-NEXT: [[TMP14:%.*]] = insertelement <2 x float> undef, float [[TMP12]], i32 0			; STORE-NEXT: [[MUL18_NEG:%.*]] = fmul fast float [[W1_099]], 0xBFF3333340000000
	; STORE-NEXT: [[TMP15:%.*]] = insertelement <2 x float> [[TMP14]], float [[TMP13]], i32 1			; STORE-NEXT: [[SUB92:%.*]] = fadd fast float [[MUL17]], [[MUL18_NEG]]
	; STORE-NEXT: [[TMP16:%.*]] = fmul fast <2 x float> <float 0x3FF19999A0000000, float 0xBFF3333340000000>, [[TMP15]]			; STORE-NEXT: [[SUB19]] = fadd fast float [[SUB92]], [[TMP8]]
	; STORE-NEXT: [[TMP17:%.*]] = extractelement <2 x float> [[TMP16]], i32 0
	; STORE-NEXT: [[TMP18:%.*]] = extractelement <2 x float> [[TMP16]], i32 1
	; STORE-NEXT: [[SUB92:%.*]] = fadd fast float [[TMP17]], [[TMP18]]
	; STORE-NEXT: [[SUB19]] = fadd fast float [[SUB92]], [[TMP10]]
	; STORE-NEXT: [[MUL20:%.*]] = fmul fast float [[SUB19]], 0x4000CCCCC0000000			; STORE-NEXT: [[MUL20:%.*]] = fmul fast float [[SUB19]], 0x4000CCCCC0000000
	; STORE-NEXT: [[TMP19:%.*]] = fmul fast <4 x float> <float 0xC0019999A0000000, float 0x4002666660000000, float 0x4008CCCCC0000000, float 0xC0099999A0000000>, [[TMP11]]			; STORE-NEXT: [[MUL21_NEG:%.*]] = fmul fast float [[W0_0100]], 0xC0019999A0000000
	; STORE-NEXT: [[ADD2293:%.*]] = fadd fast float undef, undef			; STORE-NEXT: [[MUL23:%.*]] = fmul fast float [[W1_099]], 0x4002666660000000
	; STORE-NEXT: [[ADD24:%.*]] = fadd fast float [[ADD2293]], undef			; STORE-NEXT: [[MUL25:%.*]] = fmul fast float [[W2_096]], 0x4008CCCCC0000000
	; STORE-NEXT: [[SUB2694:%.*]] = fadd fast float [[ADD24]], undef			; STORE-NEXT: [[MUL27_NEG:%.*]] = fmul fast float [[W3_097]], 0xC0099999A0000000
	; STORE-NEXT: [[RDX_SHUF:%.*]] = shufflevector <4 x float> [[TMP19]], <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>			; STORE-NEXT: [[ADD2293:%.*]] = fadd fast float [[MUL27_NEG]], [[MUL25]]
	; STORE-NEXT: [[BIN_RDX:%.*]] = fadd fast <4 x float> [[TMP19]], [[RDX_SHUF]]			; STORE-NEXT: [[ADD24:%.*]] = fadd fast float [[ADD2293]], [[MUL23]]
	; STORE-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			; STORE-NEXT: [[SUB2694:%.*]] = fadd fast float [[ADD24]], [[MUL21_NEG]]
	; STORE-NEXT: [[BIN_RDX2:%.*]] = fadd fast <4 x float> [[BIN_RDX]], [[RDX_SHUF1]]			; STORE-NEXT: [[SUB28]] = fadd fast float [[SUB2694]], [[MUL20]]
	; STORE-NEXT: [[TMP20:%.*]] = extractelement <4 x float> [[BIN_RDX2]], i32 0
	; STORE-NEXT: [[TMP21]] = fadd fast float [[TMP20]], [[MUL20]]
	; STORE-NEXT: [[SUB28:%.*]] = fadd fast float [[SUB2694]], [[MUL20]]
	; STORE-NEXT: [[INC]] = add nuw i32 [[J_098]], 1			; STORE-NEXT: [[INC]] = add nuw i32 [[J_098]], 1
	; STORE-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[ARG_B]]			; STORE-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[ARG_B]]
	; STORE-NEXT: [[TMP22:%.*]] = insertelement <4 x float> undef, float [[SUB19]], i32 0
	; STORE-NEXT: [[TMP23:%.*]] = insertelement <4 x float> [[TMP22]], float [[TMP12]], i32 1
	; STORE-NEXT: [[TMP24:%.*]] = insertelement <4 x float> [[TMP23]], float [[TMP21]], i32 2
	; STORE-NEXT: [[TMP25]] = extractelement <4 x float> [[TMP11]], i32 2
	; STORE-NEXT: [[TMP26]] = insertelement <4 x float> [[TMP24]], float [[TMP25]], i32 3
	; STORE-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16]]			; STORE-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16]]
	;			;
	entry:			entry:
	%cmp1495 = icmp eq i32 %arg_B, 0			%cmp1495 = icmp eq i32 %arg_B, 0
	br label %for.body			br label %for.body

	for.cond.cleanup: ; preds = %for.cond.cleanup15			for.cond.cleanup: ; preds = %for.cond.cleanup15
	ret void			ret void
	▲ Show 20 Lines • Show All 1,039 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/reorder_phi.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=x86_64-unknown -mcpu=corei7-avx \| FileCheck %s			; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=x86_64-unknown -mcpu=corei7-avx \| FileCheck %s

	%struct.complex = type { float, float }			%struct.complex = type { float, float }

	define void @foo (%struct.complex* %A, %struct.complex* %B, %struct.complex* %Result) {			define void @foo (%struct.complex* %A, %struct.complex* %B, %struct.complex* %Result) {
	; CHECK-LABEL: @foo(			; CHECK-LABEL: @foo(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 256, 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 256, 0
	; CHECK-NEXT: br label [[LOOP:%.*]]			; CHECK-NEXT: br label [[LOOP:%.*]]
	; CHECK: loop:			; CHECK: loop:
	; CHECK-NEXT: [[TMP1:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[TMP25:%.*]], [[LOOP]] ]			; CHECK-NEXT: [[TMP1:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[TMP20:%.*]], [[LOOP]] ]
	; CHECK-NEXT: [[TMP2:%.]] = phi <2 x float> [ zeroinitializer, [[ENTRY]] ], [ [[TMP24:%.]], [[LOOP]] ]			; CHECK-NEXT: [[TMP2:%.]] = phi float [ 0.000000e+00, [[ENTRY]] ], [ [[TMP19:%.]], [[LOOP]] ]
	; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds [[STRUCT_COMPLEX:%.]], %struct.complex* [[A:%.*]], i64 [[TMP1]], i32 0			; CHECK-NEXT: [[TMP3:%.]] = phi float [ 0.000000e+00, [[ENTRY]] ], [ [[TMP18:%.]], [[LOOP]] ]
	; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[A]], i64 [[TMP1]], i32 1			; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds [[STRUCT_COMPLEX:%.]], %struct.complex* [[A:%.*]], i64 [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP5:%.]] = bitcast float [[TMP3]] to <2 x float>*			; CHECK-NEXT: [[TMP5:%.]] = load float, float [[TMP4]], align 4
	; CHECK-NEXT: [[TMP6:%.]] = load <2 x float>, <2 x float> [[TMP5]], align 4			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[A]], i64 [[TMP1]], i32 1
	; CHECK-NEXT: [[TMP7:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[B:%.*]], i64 [[TMP1]], i32 0			; CHECK-NEXT: [[TMP7:%.]] = load float, float [[TMP6]], align 4
	; CHECK-NEXT: [[TMP8:%.]] = load float, float [[TMP7]], align 4			; CHECK-NEXT: [[TMP8:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[B:%.*]], i64 [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[B]], i64 [[TMP1]], i32 1			; CHECK-NEXT: [[TMP9:%.]] = load float, float [[TMP8]], align 4
	; CHECK-NEXT: [[TMP10:%.]] = load float, float [[TMP9]], align 4			; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[B]], i64 [[TMP1]], i32 1
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <2 x float> undef, float [[TMP8]], i32 0			; CHECK-NEXT: [[TMP11:%.]] = load float, float [[TMP10]], align 4
	; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x float> [[TMP11]], float [[TMP8]], i32 1			; CHECK-NEXT: [[TMP12:%.*]] = fmul float [[TMP5]], [[TMP9]]
	; CHECK-NEXT: [[TMP13:%.*]] = fmul <2 x float> [[TMP6]], [[TMP12]]			; CHECK-NEXT: [[TMP13:%.*]] = fmul float [[TMP7]], [[TMP11]]
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x float> [[TMP6]], i32 1			; CHECK-NEXT: [[TMP14:%.*]] = fsub float [[TMP12]], [[TMP13]]
	; CHECK-NEXT: [[TMP15:%.*]] = insertelement <2 x float> undef, float [[TMP14]], i32 0			; CHECK-NEXT: [[TMP15:%.*]] = fmul float [[TMP7]], [[TMP9]]
	; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x float> [[TMP6]], i32 0			; CHECK-NEXT: [[TMP16:%.*]] = fmul float [[TMP5]], [[TMP11]]
	; CHECK-NEXT: [[TMP17:%.*]] = insertelement <2 x float> [[TMP15]], float [[TMP16]], i32 1			; CHECK-NEXT: [[TMP17:%.*]] = fadd float [[TMP15]], [[TMP16]]
	; CHECK-NEXT: [[TMP18:%.*]] = insertelement <2 x float> undef, float [[TMP10]], i32 0			; CHECK-NEXT: [[TMP18]] = fadd float [[TMP3]], [[TMP14]]
	; CHECK-NEXT: [[TMP19:%.*]] = insertelement <2 x float> [[TMP18]], float [[TMP10]], i32 1			; CHECK-NEXT: [[TMP19]] = fadd float [[TMP2]], [[TMP17]]
	; CHECK-NEXT: [[TMP20:%.*]] = fmul <2 x float> [[TMP17]], [[TMP19]]			; CHECK-NEXT: [[TMP20]] = add nuw nsw i64 [[TMP1]], 1
	; CHECK-NEXT: [[TMP21:%.*]] = fsub <2 x float> [[TMP13]], [[TMP20]]			; CHECK-NEXT: [[TMP21:%.*]] = icmp eq i64 [[TMP20]], [[TMP0]]
	; CHECK-NEXT: [[TMP22:%.*]] = fadd <2 x float> [[TMP13]], [[TMP20]]			; CHECK-NEXT: br i1 [[TMP21]], label [[EXIT:%.*]], label [[LOOP]]
	; CHECK-NEXT: [[TMP23:%.*]] = shufflevector <2 x float> [[TMP21]], <2 x float> [[TMP22]], <2 x i32> <i32 0, i32 3>
	; CHECK-NEXT: [[TMP24]] = fadd <2 x float> [[TMP2]], [[TMP23]]
	; CHECK-NEXT: [[TMP25]] = add nuw nsw i64 [[TMP1]], 1
	; CHECK-NEXT: [[TMP26:%.*]] = icmp eq i64 [[TMP25]], [[TMP0]]
	; CHECK-NEXT: br i1 [[TMP26]], label [[EXIT:%.*]], label [[LOOP]]
	; CHECK: exit:			; CHECK: exit:
	; CHECK-NEXT: [[TMP27:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[RESULT:%.*]], i32 0, i32 0			; CHECK-NEXT: [[TMP22:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[RESULT:%.*]], i32 0, i32 0
	; CHECK-NEXT: [[TMP28:%.*]] = extractelement <2 x float> [[TMP24]], i32 0			; CHECK-NEXT: store float [[TMP18]], float* [[TMP22]], align 4
	; CHECK-NEXT: store float [[TMP28]], float* [[TMP27]], align 4			; CHECK-NEXT: [[TMP23:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[RESULT]], i32 0, i32 1
	; CHECK-NEXT: [[TMP29:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[RESULT]], i32 0, i32 1			; CHECK-NEXT: store float [[TMP19]], float* [[TMP23]], align 4
	; CHECK-NEXT: [[TMP30:%.*]] = extractelement <2 x float> [[TMP24]], i32 1
	; CHECK-NEXT: store float [[TMP30]], float* [[TMP29]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%0 = add i64 256, 0			%0 = add i64 256, 0
	br label %loop			br label %loop

	loop:			loop:
	%1 = phi i64 [ 0, %entry ], [ %20, %loop ]			%1 = phi i64 [ 0, %entry ], [ %20, %loop ]
	Show All 30 Lines

test/Transforms/SLPVectorizer/X86/simplebb.ll

Show First 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	;
ret void		ret void
}		}

; Don't vectorize volatile loads.		; Don't vectorize volatile loads.
define void @test_volatile_load(double* %a, double* %b, double* %c) {		define void @test_volatile_load(double* %a, double* %b, double* %c) {
; CHECK-LABEL: @test_volatile_load(		; CHECK-LABEL: @test_volatile_load(
; CHECK-NEXT: [[I0:%.]] = load volatile double, double [[A:%.*]], align 8		; CHECK-NEXT: [[I0:%.]] = load volatile double, double [[A:%.*]], align 8
; CHECK-NEXT: [[I1:%.]] = load volatile double, double [[B:%.*]], align 8		; CHECK-NEXT: [[I1:%.]] = load volatile double, double [[B:%.*]], align 8
		; CHECK-NEXT: [[MUL:%.*]] = fmul double [[I0]], [[I1]]
; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[A]], i64 1		; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[A]], i64 1
; CHECK-NEXT: [[I3:%.]] = load double, double [[ARRAYIDX3]], align 8		; CHECK-NEXT: [[I3:%.]] = load double, double [[ARRAYIDX3]], align 8
; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds double, double [[B]], i64 1		; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds double, double [[B]], i64 1
; CHECK-NEXT: [[I4:%.]] = load double, double [[ARRAYIDX4]], align 8		; CHECK-NEXT: [[I4:%.]] = load double, double [[ARRAYIDX4]], align 8
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> undef, double [[I0]], i32 0		; CHECK-NEXT: [[MUL5:%.*]] = fmul double [[I3]], [[I4]]
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[I3]], i32 1		; CHECK-NEXT: store double [[MUL]], double* [[C:%.*]], align 8
; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> undef, double [[I1]], i32 0		; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[C]], i64 1
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> [[TMP3]], double [[I4]], i32 1		; CHECK-NEXT: store double [[MUL5]], double* [[ARRAYIDX5]], align 8
; CHECK-NEXT: [[TMP5:%.*]] = fmul <2 x double> [[TMP2]], [[TMP4]]
; CHECK-NEXT: [[TMP6:%.]] = bitcast double [[C:%.]] to <2 x double>
; CHECK-NEXT: store <2 x double> [[TMP5]], <2 x double>* [[TMP6]], align 8
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%i0 = load volatile double, double* %a, align 8		%i0 = load volatile double, double* %a, align 8
%i1 = load volatile double, double* %b, align 8		%i1 = load volatile double, double* %b, align 8
%mul = fmul double %i0, %i1		%mul = fmul double %i0, %i1
%arrayidx3 = getelementptr inbounds double, double* %a, i64 1		%arrayidx3 = getelementptr inbounds double, double* %a, i64 1
%i3 = load double, double* %arrayidx3, align 8		%i3 = load double, double* %arrayidx3, align 8
%arrayidx4 = getelementptr inbounds double, double* %b, i64 1		%arrayidx4 = getelementptr inbounds double, double* %b, i64 1
Show All 38 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Reduce FADD/FSUB/FMUL costs on later targets (PR36280)ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 135800

lib/Target/X86/X86TargetTransformInfo.cpp

test/Analysis/CostModel/X86/arith-fp.ll

test/Analysis/CostModel/X86/intrinsic-cost.ll

test/Transforms/SLPVectorizer/X86/PR36280.ll

test/Transforms/SLPVectorizer/X86/cse.ll

test/Transforms/SLPVectorizer/X86/horizontal.ll

test/Transforms/SLPVectorizer/X86/reorder_phi.ll

test/Transforms/SLPVectorizer/X86/simplebb.ll

[X86][SSE] Reduce FADD/FSUB/FMUL costs on later targets (PR36280)
ClosedPublic