This is an archive of the discontinued LLVM Phabricator instance.

[TTI CostModel] change default cost of FP ops to 1 (PR36280)
AbandonedPublic

Authored by spatel on Feb 8 2018, 9:09 AM.

Download Raw Diff

Details

Reviewers

hfinkel
ABataev
efriedma
fhahn
RKSimon
craig.topper
javed.absar

Commits

rG3e8a76abfda5: [TTI CostModel] change default cost of FP ops to 1 (PR36280)
rL325515: [TTI CostModel] change default cost of FP ops to 1 (PR36280)

Summary

This change was mentioned at least as far back as:
https://bugs.llvm.org/show_bug.cgi?id=26837#c26
...and I found a real program that directly shows the harm. Himeno running on AMD Jaguar gets 6% slower with SLP vectorization:
https://bugs.llvm.org/show_bug.cgi?id=36280

I don't know the history here. Maybe this was set in the Pentium 4 days, or there's just confusion about which cost we're modelling.

I've added a comment to make it clear that this is the throughput cost of a math instruction.

The div/rem costs for x86 look very wrong in some cases, but I think that's already true, so we can fix those in follow-up patches. There's also evidence that more cost model changes are needed to solve SLP problems as shown in D42981, but I think that's an independent problem (though the solution may be adjusted assuming this change is approved).

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Feb 8 2018, 9:09 AM

Herald added subscribers: kristof.beyls, javed.absar, mcrosier, aemerson. · View Herald TranscriptFeb 8 2018, 9:09 AM

spatel retitled this revision from [TTI CostModel] change default cost of FP ops to 1 (PR to [TTI CostModel] change default cost of FP ops to 1 (PR36280).Feb 8 2018, 9:09 AM

ABataev added inline comments.Feb 8 2018, 9:28 AM

include/llvm/CodeGen/BasicTTIImpl.h
494 ↗	(On Diff #133434)	Maybe it is better to add a virtual method that will return the default cost of the integer and floating point operations for the target? The default implementation should keep 1 and 2, but for X86 it should return 1 in both cases.

spatel added inline comments.Feb 8 2018, 9:49 AM

include/llvm/CodeGen/BasicTTIImpl.h
494 ↗	(On Diff #133434)	I don't know of any target where anything but '1' is the right default answer, so I'd rather not perpetuate this. I think it's better to correct the problems revealed by this change.

hfinkel added inline comments.Feb 8 2018, 12:01 PM

include/llvm/CodeGen/BasicTTIImpl.h
494 ↗	(On Diff #133434)	I agree, but recall that these are measuring reciprocal throughputs, and it's not unreasonable to assume that floating-point ops will have lower throughputs than integer operations in some general sense. Nevertheless, '2' seems somewhat arbitrary in this regard.

spatel added inline comments.Feb 8 2018, 1:28 PM

include/llvm/CodeGen/BasicTTIImpl.h
494 ↗	(On Diff #133434)	Ah, the code comment isn't accurate then. And since we're using integers, this is really a relative reciprocal throughput (can't go under 1) rather than an actual cycle-based value?

hfinkel added inline comments.Feb 8 2018, 1:37 PM

include/llvm/CodeGen/BasicTTIImpl.h
494 ↗	(On Diff #133434)	Not sure what you mean by "cycle based." All cost model results are relative, but this cost model used by the vectorizer is supposed to return (relative) reciprocal throughputs. We now have a separate cost model for latency (and a separate "user cost" model, which essentially models micro-op count).

spatel added inline comments.Feb 8 2018, 1:50 PM

include/llvm/CodeGen/BasicTTIImpl.h
494 ↗	(On Diff #133434)	In the backend, the scheduler model shows instruction timing like this (in test/CodeGen/X86/sse2-schedule.ll): ; SKYLAKE-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [4:0.50] The "0.50" means we can execute 2 of these per cycle because there are 2 units that process these instructions. We're integer 1-based here, so if we want to show that a target has 3 integer adders and 1 FP adder, we would make FP ops cost 3 rather than 1 (instead of the 0.33 for integer adds that we would expect in the scheduler model).

Wouldn't we be better off avoiding affecting generic targets and just adding tuned FADD/FSUB/FMUL costs to getArithmeticInstrCost in X86TargetTransformInfo.cpp? The general rule we used for FDIV/FSQRT was to use the 'worst hardware' cost for a given SSE level - so AVX1 (SB/JG) might have a better cost than SSE42 (P4) etc. which seemed to work pretty well.

If we go that route we might want to add extra cpu targets to some of the x86 slp tests.

test/Transforms/SLPVectorizer/X86/PR36280.ll
4 ↗	(On Diff #133434)	It'd probably be useful to include a comment that this code snippet is from the himeno benchmark?

Patch updated:
No functional changes from the last rev, but tried to make the code comment accurate and added a comment to the PR36280 test to explain the motivation.

In D43079#1003118, @RKSimon wrote:

Wouldn't we be better off avoiding affecting generic targets and just adding tuned FADD/FSUB/FMUL costs to getArithmeticInstrCost in X86TargetTransformInfo.cpp?

I'm not opposed to making that change (and it seems clear that we need to make other x86 cost model changes), but I think this one is independent of that. This is just supposed to make the default cost less necessary to require an override.

The fact that it helps the motivating (PR36280) case that led me here appears to be coincidence. I'm not understanding something fundamental about how SLP uses this cost to decide profitability, but I'll try to sort that out in D42981 or later.

Ping.

@fhahn Do you have any ARM/AARCH64 concerns? Otherwise, I'm happy with this change.

I think this change should be fine for modern AArch64 chips. For example, Cortex-A72 has 2 integer and 2 fp units. Other modern cores have 3 integer units and 2 fp units. For both, those settings should be more realistic. I can't run any benchmarks at the moment though, as I am travelling.

LGTM - as you've said in the comment I'd still like us to look into improving target specific cost values instead of relying on these defaults.

This revision is now accepted and ready to land.Feb 19 2018, 6:47 AM

fhahn accepted this revision.Feb 19 2018, 6:53 AM

Rebase. The test that previously regressed in X86/insert-element-build-vector.ll does not anymore, so that's good - probably due to D42657.

Closed by commit rL325515: [TTI CostModel] change default cost of FP ops to 1 (PR36280) (authored by spatel). · Explain WhyFeb 19 2018, 8:13 AM

This revision was automatically updated to reflect the committed changes.

Hi Sanjay,

The patch caused regressions in the LLVM benchmarks and in Spec2k/Spec2k6 benchmarks on AArch64 Cortex-A53:

SingleSource/Benchmarks/Misc/matmul_f64_4x4: 49%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt: 5.32%
CFP2000/188.ammp/188.ammp: 3.58%
CFP2000/177.mesa/177.mesa: 2.48%
CFP2006/444.namd/444.namd: 2.49%

The regression of SingleSource/Benchmarks/Misc/matmul_f64_4x4 can also be seen on the public bot: http://lnt.llvm.org/db_default/v4/nts/90636
It is 128.85%.

The main difference in generated code is FMUL(FP, scalar) instead of FMUL(SIMD, scalar):

fmul d20, d16, d2

instead of

fmul v17.2d, v1.2d, v5.2d

This also caused code size increase: 6.04% in SingleSource/Benchmarks/Misc/matmul_f64_4x4

I am working on a reproducer.

Thanks,
Evgeny Astigeevich

In D43079#1013269, @eastig wrote:
Hi Sanjay,

The patch caused regressions in the LLVM benchmarks and in Spec2k/Spec2k6 benchmarks on AArch64 Cortex-A53:

SingleSource/Benchmarks/Misc/matmul_f64_4x4: 49%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt: 5.32%
CFP2000/188.ammp/188.ammp: 3.58%
CFP2000/177.mesa/177.mesa: 2.48%
CFP2006/444.namd/444.namd: 2.49%

The regression of SingleSource/Benchmarks/Misc/matmul_f64_4x4 can also be seen on the public bot: http://lnt.llvm.org/db_default/v4/nts/90636
It is 128.85%.

The main difference in generated code is FMUL(FP, scalar) instead of FMUL(SIMD, scalar):
fmul d20, d16, d2
instead of
fmul v17.2d, v1.2d, v5.2d
This also caused code size increase: 6.04% in SingleSource/Benchmarks/Misc/matmul_f64_4x4

I am working on a reproducer.

Thanks. We knew this change was likely to cause perf regressions based on some of the x86 diffs, so having those reductions will help tune the models in general and specifically for AArch64.

Ie, we should be able to solve the AArch64 problems with AArch64-specific cost model changes rather than reverting this. For example as @fhahn mentioned, we might want to make the int-to-FP ratio 3:2 for some cores. Another possibility is overriding the fmul/fsub/fadd AArch64 costs to be more realistic (as we also probably have to do for x86).

In D43079#1013321, @spatel wrote:
In D43079#1013269, @eastig wrote:
Hi Sanjay,

The patch caused regressions in the LLVM benchmarks and in Spec2k/Spec2k6 benchmarks on AArch64 Cortex-A53:

SingleSource/Benchmarks/Misc/matmul_f64_4x4: 49%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt: 5.32%
CFP2000/188.ammp/188.ammp: 3.58%
CFP2000/177.mesa/177.mesa: 2.48%
CFP2006/444.namd/444.namd: 2.49%

The regression of SingleSource/Benchmarks/Misc/matmul_f64_4x4 can also be seen on the public bot: http://lnt.llvm.org/db_default/v4/nts/90636
It is 128.85%.

The main difference in generated code is FMUL(FP, scalar) instead of FMUL(SIMD, scalar):
fmul d20, d16, d2
instead of
fmul v17.2d, v1.2d, v5.2d
This also caused code size increase: 6.04% in SingleSource/Benchmarks/Misc/matmul_f64_4x4

I am working on a reproducer.
Thanks. We knew this change was likely to cause perf regressions based on some of the x86 diffs, so having those reductions will help tune the models in general and specifically for AArch64.

Ie, we should be able to solve the AArch64 problems with AArch64-specific cost model changes rather than reverting this. For example as @fhahn mentioned, we might want to make the int-to-FP ratio 3:2 for some cores. Another possibility is overriding the fmul/fsub/fadd AArch64 costs to be more realistic (as we also probably have to do for x86).

Please revert until these things get worked out so that we can properly track performance. We are seeing many regressions including 17% on 444.namd and 12% on 482.sphinx3 in SPECfp 2006.

In D43079#1013835, @anemet wrote:

Please revert until these things get worked out so that we can properly track performance. We are seeing many regressions including 17% on 444.namd and 12% on 482.sphinx3 in SPECfp 2006.

Hi Adam -

Rather than reverting for all targets, can we just hack ARM/AArch with patches like this (if we can add at least one test to show what changed, that would be better of course) :

Index: lib/Target/ARM/ARMTargetTransformInfo.cpp
===================================================================
--- lib/Target/ARM/ARMTargetTransformInfo.cpp	(revision 325579)
+++ lib/Target/ARM/ARMTargetTransformInfo.cpp	(working copy)
@@ -514,6 +514,13 @@
   int Cost = BaseT::getArithmeticInstrCost(Opcode, Ty, Op1Info, Op2Info,
                                            Opd1PropInfo, Opd2PropInfo);
 
+  // Assume that floating point arithmetic operations cost twice as much as
+  // integer operations.
+  // FIXME: This is a win on several perf benchmarks running on CPU model ???,
+  // but there are no regression tests that show why or how this is good.
+  if (Ty->isFPOrFPVectorTy())
+    Cost *= 2;
+
   // This is somewhat of a hack. The problem that we are facing is that SROA
   // creates a sequence of shift, and, or instructions to construct values.
   // These sequences are recognized by the ISel and have zero-cost. Not so for

In D43079#1013886, @spatel wrote:
In D43079#1013835, @anemet wrote:

Please revert until these things get worked out so that we can properly track performance. We are seeing many regressions including 17% on 444.namd and 12% on 482.sphinx3 in SPECfp 2006.

Hi Adam -

Rather than reverting for all targets, can we just hack ARM/AArch with patches like this (if we can add at least one test to show what changed, that would be better of course) :
Index: lib/Target/ARM/ARMTargetTransformInfo.cpp
===================================================================
--- lib/Target/ARM/ARMTargetTransformInfo.cpp	(revision 325579)
+++ lib/Target/ARM/ARMTargetTransformInfo.cpp	(working copy)
@@ -514,6 +514,13 @@
   int Cost = BaseT::getArithmeticInstrCost(Opcode, Ty, Op1Info, Op2Info,
                                            Opd1PropInfo, Opd2PropInfo);
 
+  // Assume that floating point arithmetic operations cost twice as much as
+  // integer operations.
+  // FIXME: This is a win on several perf benchmarks running on CPU model ???,
+  // but there are no regression tests that show why or how this is good.
+  if (Ty->isFPOrFPVectorTy())
+    Cost *= 2;
+
   // This is somewhat of a hack. The problem that we are facing is that SROA
   // creates a sequence of shift, and, or instructions to construct values.
   // These sequences are recognized by the ISel and have zero-cost. Not so for

Seeing such major swings, my preference would be to revert and put the new version up for review (I think that your hack works). Then commit the new combined version after a few days so that the perf bots got a chance to recover. What do you think?

Adam

In D43079#1013907, @anemet wrote:

Seeing such major swings, my preference would be to revert and put the new version up for review (I think that your hack works). Then commit the new combined version after a few days so that the perf bots got a chance to recover. What do you think?

Ok, this was too ambitious. Reverted at rL325658 and reopened:
https://bugs.llvm.org/show_bug.cgi?id=36280

This revision is now accepted and ready to land.Feb 20 2018, 5:49 PM

spatel planned changes to this revision.Feb 20 2018, 5:49 PM

@fhahn wrote (but it didn't transfer here):

I have not thought this through fully yet, but couldn't we use the scheduling model to get the number of units available
for certain instructions for backends using the machine scheduler? And determine the throughput based on that?

Yes, I thought about that too. We already use the sched model to get instruction latency and to let the unroller know the size of the reorder buffer, so using that for throughputs and/or uops shouldn't be too hard.

But we'll have to check how exactly these costs are being used. By using a more realistic cost model, we may get into more trouble because clients may have bent the costs to match their own cost formula rather than an accurate hardware model.

Hi Sanjay,

I attached a regression test to https://bugs.llvm.org/show_bug.cgi?id=36280 .

Thanks,
Evgeny Astigeevich

In D43079#1013927, @spatel wrote:

In D43079#1013907, @anemet wrote:

Seeing such major swings, my preference would be to revert and put the new version up for review (I think that your hack works). Then commit the new combined version after a few days so that the perf bots got a chance to recover. What do you think?

Ok, this was too ambitious. Reverted at rL325658 and reopened:
https://bugs.llvm.org/show_bug.cgi?id=36280

Thanks! I can confirm that our perf numbers have recovered.

spatel mentioned this in rL325717: [AArch64] add SLP test for matmul (PR36280); NFC.Feb 21 2018, 12:37 PM

spatel mentioned this in D43769: [TTI] rename getArithmeticInstructionCost() to getUnitThroughput(); NFC.Feb 26 2018, 10:18 AM

spatel mentioned this in rL326217: [AArch64] add SLP test based on TSVC; NFC.Feb 27 2018, 10:08 AM

spatel mentioned this in rL326221: [ARM] add loop vectorizer test based on 482.sphinx3 from SPEC2006; NFC.Feb 27 2018, 10:36 AM

Abandoning. I think we can reach the goal of more accurate cost models with target-specific and CPU-specific steps like D46276.

Herald added a reviewer: javed.absar. · View Herald TranscriptMay 2 2018, 3:13 PM

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

CodeGen/

BasicTTIImpl.h

10 lines

test/

Analysis/

CostModel/

X86/

arith-fp.ll

324 lines

intrinsic-cost.ll

4 lines

reduction.ll

8 lines

Transforms/

LoopVectorize/

X86/

imprecise-through-phis.ll

45 lines

SLPVectorizer/

AArch64/

remarks.ll

2 lines

X86/

19 lines

21 lines

148 lines

59 lines

12 lines

Diff 134925

llvm/trunk/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 482 Lines • ▼ Show 20 Lines	unsigned getArithmeticInstrCost(
ArrayRef<const Value > Args = ArrayRef<const Value >()) {		ArrayRef<const Value > Args = ArrayRef<const Value >()) {
// Check if any of the operands are vector operands.		// Check if any of the operands are vector operands.
const TargetLoweringBase *TLI = getTLI();		const TargetLoweringBase *TLI = getTLI();
int ISD = TLI->InstructionOpcodeToISD(Opcode);		int ISD = TLI->InstructionOpcodeToISD(Opcode);
assert(ISD && "Invalid opcode");		assert(ISD && "Invalid opcode");

std::pair<unsigned, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);		std::pair<unsigned, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);

bool IsFloat = Ty->isFPOrFPVectorTy();		// Assume that the throughput of any integer or floating-point math
// Assume that floating point arithmetic operations cost twice as much as		// operation is the same and maximal (disregarding free operations).
// integer operations.		// That is, operations with less throughput should have a relative cost
unsigned OpCost = (IsFloat ? 2 : 1);		// greater than 1. Targets should override this assumption when they can
		// provide more accurate information.
		unsigned OpCost = 1;

if (TLI->isOperationLegalOrPromote(ISD, LT.second)) {		if (TLI->isOperationLegalOrPromote(ISD, LT.second)) {
// The operation is legal. Assume it costs 1.		// The operation is legal. Assume it costs 1.
// TODO: Once we have extract/insert subvector cost we need to use them.		// TODO: Once we have extract/insert subvector cost we need to use them.
return LT.first * OpCost;		return LT.first * OpCost;
}		}

if (!TLI->isOperationExpand(ISD, LT.second)) {		if (!TLI->isOperationExpand(ISD, LT.second)) {
▲ Show 20 Lines • Show All 804 Lines • Show Last 20 Lines

llvm/trunk/test/Analysis/CostModel/X86/arith-fp.ll

; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE2		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE2
; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse4.2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE42		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse4.2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE42
; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX
; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx2,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX2		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx2,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX2
; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512F		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512F
; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f,+avx512bw \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512BW		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f,+avx512bw \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512BW

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"		target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.8.0"		target triple = "x86_64-apple-macosx10.8.0"

; CHECK-LABEL: 'fadd'		; CHECK-LABEL: 'fadd'
define i32 @fadd(i32 %arg) {		define i32 @fadd(i32 %arg) {
; SSE2: cost of 2 {{.*}} %F32 = fadd		; SSE2: cost of 1 {{.*}} %F32 = fadd
; SSE42: cost of 2 {{.*}} %F32 = fadd		; SSE42: cost of 1 {{.*}} %F32 = fadd
; AVX: cost of 2 {{.*}} %F32 = fadd		; AVX: cost of 1 {{.*}} %F32 = fadd
; AVX2: cost of 2 {{.*}} %F32 = fadd		; AVX2: cost of 1 {{.*}} %F32 = fadd
; AVX512: cost of 2 {{.*}} %F32 = fadd		; AVX512: cost of 1 {{.*}} %F32 = fadd
%F32 = fadd float undef, undef		%F32 = fadd float undef, undef
; SSE2: cost of 2 {{.*}} %V4F32 = fadd		; SSE2: cost of 1 {{.*}} %V4F32 = fadd
; SSE42: cost of 2 {{.*}} %V4F32 = fadd		; SSE42: cost of 1 {{.*}} %V4F32 = fadd
; AVX: cost of 2 {{.*}} %V4F32 = fadd		; AVX: cost of 1 {{.*}} %V4F32 = fadd
; AVX2: cost of 2 {{.*}} %V4F32 = fadd		; AVX2: cost of 1 {{.*}} %V4F32 = fadd
; AVX512: cost of 2 {{.*}} %V4F32 = fadd		; AVX512: cost of 1 {{.*}} %V4F32 = fadd
%V4F32 = fadd <4 x float> undef, undef		%V4F32 = fadd <4 x float> undef, undef
; SSE2: cost of 4 {{.*}} %V8F32 = fadd		; SSE2: cost of 2 {{.*}} %V8F32 = fadd
; SSE42: cost of 4 {{.*}} %V8F32 = fadd		; SSE42: cost of 2 {{.*}} %V8F32 = fadd
; AVX: cost of 2 {{.*}} %V8F32 = fadd		; AVX: cost of 1 {{.*}} %V8F32 = fadd
; AVX2: cost of 2 {{.*}} %V8F32 = fadd		; AVX2: cost of 1 {{.*}} %V8F32 = fadd
; AVX512: cost of 2 {{.*}} %V8F32 = fadd		; AVX512: cost of 1 {{.*}} %V8F32 = fadd
%V8F32 = fadd <8 x float> undef, undef		%V8F32 = fadd <8 x float> undef, undef
; SSE2: cost of 8 {{.*}} %V16F32 = fadd		; SSE2: cost of 4 {{.*}} %V16F32 = fadd
; SSE42: cost of 8 {{.*}} %V16F32 = fadd		; SSE42: cost of 4 {{.*}} %V16F32 = fadd
; AVX: cost of 4 {{.*}} %V16F32 = fadd		; AVX: cost of 2 {{.*}} %V16F32 = fadd
; AVX2: cost of 4 {{.*}} %V16F32 = fadd		; AVX2: cost of 2 {{.*}} %V16F32 = fadd
; AVX512: cost of 2 {{.*}} %V16F32 = fadd		; AVX512: cost of 1 {{.*}} %V16F32 = fadd
%V16F32 = fadd <16 x float> undef, undef		%V16F32 = fadd <16 x float> undef, undef

; SSE2: cost of 2 {{.*}} %F64 = fadd		; SSE2: cost of 1 {{.*}} %F64 = fadd
; SSE42: cost of 2 {{.*}} %F64 = fadd		; SSE42: cost of 1 {{.*}} %F64 = fadd
; AVX: cost of 2 {{.*}} %F64 = fadd		; AVX: cost of 1 {{.*}} %F64 = fadd
; AVX2: cost of 2 {{.*}} %F64 = fadd		; AVX2: cost of 1 {{.*}} %F64 = fadd
; AVX512: cost of 2 {{.*}} %F64 = fadd		; AVX512: cost of 1 {{.*}} %F64 = fadd
%F64 = fadd double undef, undef		%F64 = fadd double undef, undef
; SSE2: cost of 2 {{.*}} %V2F64 = fadd		; SSE2: cost of 1 {{.*}} %V2F64 = fadd
; SSE42: cost of 2 {{.*}} %V2F64 = fadd		; SSE42: cost of 1 {{.*}} %V2F64 = fadd
; AVX: cost of 2 {{.*}} %V2F64 = fadd		; AVX: cost of 1 {{.*}} %V2F64 = fadd
; AVX2: cost of 2 {{.*}} %V2F64 = fadd		; AVX2: cost of 1 {{.*}} %V2F64 = fadd
; AVX512: cost of 2 {{.*}} %V2F64 = fadd		; AVX512: cost of 1 {{.*}} %V2F64 = fadd
%V2F64 = fadd <2 x double> undef, undef		%V2F64 = fadd <2 x double> undef, undef
; SSE2: cost of 4 {{.*}} %V4F64 = fadd		; SSE2: cost of 2 {{.*}} %V4F64 = fadd
; SSE42: cost of 4 {{.*}} %V4F64 = fadd		; SSE42: cost of 2 {{.*}} %V4F64 = fadd
; AVX: cost of 2 {{.*}} %V4F64 = fadd		; AVX: cost of 1 {{.*}} %V4F64 = fadd
; AVX2: cost of 2 {{.*}} %V4F64 = fadd		; AVX2: cost of 1 {{.*}} %V4F64 = fadd
; AVX512: cost of 2 {{.*}} %V4F64 = fadd		; AVX512: cost of 1 {{.*}} %V4F64 = fadd
%V4F64 = fadd <4 x double> undef, undef		%V4F64 = fadd <4 x double> undef, undef
; SSE2: cost of 8 {{.*}} %V8F64 = fadd		; SSE2: cost of 4 {{.*}} %V8F64 = fadd
; SSE42: cost of 8 {{.*}} %V8F64 = fadd		; SSE42: cost of 4 {{.*}} %V8F64 = fadd
; AVX: cost of 4 {{.*}} %V8F64 = fadd		; AVX: cost of 2 {{.*}} %V8F64 = fadd
; AVX2: cost of 4 {{.*}} %V8F64 = fadd		; AVX2: cost of 2 {{.*}} %V8F64 = fadd
; AVX512: cost of 2 {{.*}} %V8F64 = fadd		; AVX512: cost of 1 {{.*}} %V8F64 = fadd
%V8F64 = fadd <8 x double> undef, undef		%V8F64 = fadd <8 x double> undef, undef

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'fsub'		; CHECK-LABEL: 'fsub'
define i32 @fsub(i32 %arg) {		define i32 @fsub(i32 %arg) {
; SSE2: cost of 2 {{.*}} %F32 = fsub		; SSE2: cost of 1 {{.*}} %F32 = fsub
; SSE42: cost of 2 {{.*}} %F32 = fsub		; SSE42: cost of 1 {{.*}} %F32 = fsub
; AVX: cost of 2 {{.*}} %F32 = fsub		; AVX: cost of 1 {{.*}} %F32 = fsub
; AVX2: cost of 2 {{.*}} %F32 = fsub		; AVX2: cost of 1 {{.*}} %F32 = fsub
; AVX512: cost of 2 {{.*}} %F32 = fsub		; AVX512: cost of 1 {{.*}} %F32 = fsub
%F32 = fsub float undef, undef		%F32 = fsub float undef, undef
; SSE2: cost of 2 {{.*}} %V4F32 = fsub		; SSE2: cost of 1 {{.*}} %V4F32 = fsub
; SSE42: cost of 2 {{.*}} %V4F32 = fsub		; SSE42: cost of 1 {{.*}} %V4F32 = fsub
; AVX: cost of 2 {{.*}} %V4F32 = fsub		; AVX: cost of 1 {{.*}} %V4F32 = fsub
; AVX2: cost of 2 {{.*}} %V4F32 = fsub		; AVX2: cost of 1 {{.*}} %V4F32 = fsub
; AVX512: cost of 2 {{.*}} %V4F32 = fsub		; AVX512: cost of 1 {{.*}} %V4F32 = fsub
%V4F32 = fsub <4 x float> undef, undef		%V4F32 = fsub <4 x float> undef, undef
; SSE2: cost of 4 {{.*}} %V8F32 = fsub		; SSE2: cost of 2 {{.*}} %V8F32 = fsub
; SSE42: cost of 4 {{.*}} %V8F32 = fsub		; SSE42: cost of 2 {{.*}} %V8F32 = fsub
; AVX: cost of 2 {{.*}} %V8F32 = fsub		; AVX: cost of 1 {{.*}} %V8F32 = fsub
; AVX2: cost of 2 {{.*}} %V8F32 = fsub		; AVX2: cost of 1 {{.*}} %V8F32 = fsub
; AVX512: cost of 2 {{.*}} %V8F32 = fsub		; AVX512: cost of 1 {{.*}} %V8F32 = fsub
%V8F32 = fsub <8 x float> undef, undef		%V8F32 = fsub <8 x float> undef, undef
; SSE2: cost of 8 {{.*}} %V16F32 = fsub		; SSE2: cost of 4 {{.*}} %V16F32 = fsub
; SSE42: cost of 8 {{.*}} %V16F32 = fsub		; SSE42: cost of 4 {{.*}} %V16F32 = fsub
; AVX: cost of 4 {{.*}} %V16F32 = fsub		; AVX: cost of 2 {{.*}} %V16F32 = fsub
; AVX2: cost of 4 {{.*}} %V16F32 = fsub		; AVX2: cost of 2 {{.*}} %V16F32 = fsub
; AVX512: cost of 2 {{.*}} %V16F32 = fsub		; AVX512: cost of 1 {{.*}} %V16F32 = fsub
%V16F32 = fsub <16 x float> undef, undef		%V16F32 = fsub <16 x float> undef, undef

; SSE2: cost of 2 {{.*}} %F64 = fsub		; SSE2: cost of 1 {{.*}} %F64 = fsub
; SSE42: cost of 2 {{.*}} %F64 = fsub		; SSE42: cost of 1 {{.*}} %F64 = fsub
; AVX: cost of 2 {{.*}} %F64 = fsub		; AVX: cost of 1 {{.*}} %F64 = fsub
; AVX2: cost of 2 {{.*}} %F64 = fsub		; AVX2: cost of 1 {{.*}} %F64 = fsub
; AVX512: cost of 2 {{.*}} %F64 = fsub		; AVX512: cost of 1 {{.*}} %F64 = fsub
%F64 = fsub double undef, undef		%F64 = fsub double undef, undef
; SSE2: cost of 2 {{.*}} %V2F64 = fsub		; SSE2: cost of 1 {{.*}} %V2F64 = fsub
; SSE42: cost of 2 {{.*}} %V2F64 = fsub		; SSE42: cost of 1 {{.*}} %V2F64 = fsub
; AVX: cost of 2 {{.*}} %V2F64 = fsub		; AVX: cost of 1 {{.*}} %V2F64 = fsub
; AVX2: cost of 2 {{.*}} %V2F64 = fsub		; AVX2: cost of 1 {{.*}} %V2F64 = fsub
; AVX512: cost of 2 {{.*}} %V2F64 = fsub		; AVX512: cost of 1 {{.*}} %V2F64 = fsub
%V2F64 = fsub <2 x double> undef, undef		%V2F64 = fsub <2 x double> undef, undef
; SSE2: cost of 4 {{.*}} %V4F64 = fsub		; SSE2: cost of 2 {{.*}} %V4F64 = fsub
; SSE42: cost of 4 {{.*}} %V4F64 = fsub		; SSE42: cost of 2 {{.*}} %V4F64 = fsub
; AVX: cost of 2 {{.*}} %V4F64 = fsub		; AVX: cost of 1 {{.*}} %V4F64 = fsub
; AVX2: cost of 2 {{.*}} %V4F64 = fsub		; AVX2: cost of 1 {{.*}} %V4F64 = fsub
; AVX512: cost of 2 {{.*}} %V4F64 = fsub		; AVX512: cost of 1 {{.*}} %V4F64 = fsub
%V4F64 = fsub <4 x double> undef, undef		%V4F64 = fsub <4 x double> undef, undef
; SSE2: cost of 8 {{.*}} %V8F64 = fsub		; SSE2: cost of 4 {{.*}} %V8F64 = fsub
; SSE42: cost of 8 {{.*}} %V8F64 = fsub		; SSE42: cost of 4 {{.*}} %V8F64 = fsub
; AVX: cost of 4 {{.*}} %V8F64 = fsub		; AVX: cost of 2 {{.*}} %V8F64 = fsub
; AVX2: cost of 4 {{.*}} %V8F64 = fsub		; AVX2: cost of 2 {{.*}} %V8F64 = fsub
; AVX512: cost of 2 {{.*}} %V8F64 = fsub		; AVX512: cost of 1 {{.*}} %V8F64 = fsub
%V8F64 = fsub <8 x double> undef, undef		%V8F64 = fsub <8 x double> undef, undef

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'fmul'		; CHECK-LABEL: 'fmul'
define i32 @fmul(i32 %arg) {		define i32 @fmul(i32 %arg) {
; SSE2: cost of 2 {{.*}} %F32 = fmul		; SSE2: cost of 1 {{.*}} %F32 = fmul
; SSE42: cost of 2 {{.*}} %F32 = fmul		; SSE42: cost of 1 {{.*}} %F32 = fmul
; AVX: cost of 2 {{.*}} %F32 = fmul		; AVX: cost of 1 {{.*}} %F32 = fmul
; AVX2: cost of 2 {{.*}} %F32 = fmul		; AVX2: cost of 1 {{.*}} %F32 = fmul
; AVX512: cost of 2 {{.*}} %F32 = fmul		; AVX512: cost of 1 {{.*}} %F32 = fmul
%F32 = fmul float undef, undef		%F32 = fmul float undef, undef
; SSE2: cost of 2 {{.*}} %V4F32 = fmul		; SSE2: cost of 1 {{.*}} %V4F32 = fmul
; SSE42: cost of 2 {{.*}} %V4F32 = fmul		; SSE42: cost of 1 {{.*}} %V4F32 = fmul
; AVX: cost of 2 {{.*}} %V4F32 = fmul		; AVX: cost of 1 {{.*}} %V4F32 = fmul
; AVX2: cost of 2 {{.*}} %V4F32 = fmul		; AVX2: cost of 1 {{.*}} %V4F32 = fmul
; AVX512: cost of 2 {{.*}} %V4F32 = fmul		; AVX512: cost of 1 {{.*}} %V4F32 = fmul
%V4F32 = fmul <4 x float> undef, undef		%V4F32 = fmul <4 x float> undef, undef
; SSE2: cost of 4 {{.*}} %V8F32 = fmul		; SSE2: cost of 2 {{.*}} %V8F32 = fmul
; SSE42: cost of 4 {{.*}} %V8F32 = fmul		; SSE42: cost of 2 {{.*}} %V8F32 = fmul
; AVX: cost of 2 {{.*}} %V8F32 = fmul		; AVX: cost of 1 {{.*}} %V8F32 = fmul
; AVX2: cost of 2 {{.*}} %V8F32 = fmul		; AVX2: cost of 1 {{.*}} %V8F32 = fmul
; AVX512: cost of 2 {{.*}} %V8F32 = fmul		; AVX512: cost of 1 {{.*}} %V8F32 = fmul
%V8F32 = fmul <8 x float> undef, undef		%V8F32 = fmul <8 x float> undef, undef
; SSE2: cost of 8 {{.*}} %V16F32 = fmul		; SSE2: cost of 4 {{.*}} %V16F32 = fmul
; SSE42: cost of 8 {{.*}} %V16F32 = fmul		; SSE42: cost of 4 {{.*}} %V16F32 = fmul
; AVX: cost of 4 {{.*}} %V16F32 = fmul		; AVX: cost of 2 {{.*}} %V16F32 = fmul
; AVX2: cost of 4 {{.*}} %V16F32 = fmul		; AVX2: cost of 2 {{.*}} %V16F32 = fmul
; AVX512: cost of 2 {{.*}} %V16F32 = fmul		; AVX512: cost of 1 {{.*}} %V16F32 = fmul
%V16F32 = fmul <16 x float> undef, undef		%V16F32 = fmul <16 x float> undef, undef

; SSE2: cost of 2 {{.*}} %F64 = fmul		; SSE2: cost of 1 {{.*}} %F64 = fmul
; SSE42: cost of 2 {{.*}} %F64 = fmul		; SSE42: cost of 1 {{.*}} %F64 = fmul
; AVX: cost of 2 {{.*}} %F64 = fmul		; AVX: cost of 1 {{.*}} %F64 = fmul
; AVX2: cost of 2 {{.*}} %F64 = fmul		; AVX2: cost of 1 {{.*}} %F64 = fmul
; AVX512: cost of 2 {{.*}} %F64 = fmul		; AVX512: cost of 1 {{.*}} %F64 = fmul
%F64 = fmul double undef, undef		%F64 = fmul double undef, undef
; SSE2: cost of 2 {{.*}} %V2F64 = fmul		; SSE2: cost of 1 {{.*}} %V2F64 = fmul
; SSE42: cost of 2 {{.*}} %V2F64 = fmul		; SSE42: cost of 1 {{.*}} %V2F64 = fmul
; AVX: cost of 2 {{.*}} %V2F64 = fmul		; AVX: cost of 1 {{.*}} %V2F64 = fmul
; AVX2: cost of 2 {{.*}} %V2F64 = fmul		; AVX2: cost of 1 {{.*}} %V2F64 = fmul
; AVX512: cost of 2 {{.*}} %V2F64 = fmul		; AVX512: cost of 1 {{.*}} %V2F64 = fmul
%V2F64 = fmul <2 x double> undef, undef		%V2F64 = fmul <2 x double> undef, undef
; SSE2: cost of 4 {{.*}} %V4F64 = fmul		; SSE2: cost of 2 {{.*}} %V4F64 = fmul
; SSE42: cost of 4 {{.*}} %V4F64 = fmul		; SSE42: cost of 2 {{.*}} %V4F64 = fmul
; AVX: cost of 2 {{.*}} %V4F64 = fmul		; AVX: cost of 1 {{.*}} %V4F64 = fmul
; AVX2: cost of 2 {{.*}} %V4F64 = fmul		; AVX2: cost of 1 {{.*}} %V4F64 = fmul
; AVX512: cost of 2 {{.*}} %V4F64 = fmul		; AVX512: cost of 1 {{.*}} %V4F64 = fmul
%V4F64 = fmul <4 x double> undef, undef		%V4F64 = fmul <4 x double> undef, undef
; SSE2: cost of 8 {{.*}} %V8F64 = fmul		; SSE2: cost of 4 {{.*}} %V8F64 = fmul
; SSE42: cost of 8 {{.*}} %V8F64 = fmul		; SSE42: cost of 4 {{.*}} %V8F64 = fmul
; AVX: cost of 4 {{.*}} %V8F64 = fmul		; AVX: cost of 2 {{.*}} %V8F64 = fmul
; AVX2: cost of 4 {{.*}} %V8F64 = fmul		; AVX2: cost of 2 {{.*}} %V8F64 = fmul
; AVX512: cost of 2 {{.*}} %V8F64 = fmul		; AVX512: cost of 1 {{.*}} %V8F64 = fmul
%V8F64 = fmul <8 x double> undef, undef		%V8F64 = fmul <8 x double> undef, undef

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'fdiv'		; CHECK-LABEL: 'fdiv'
define i32 @fdiv(i32 %arg) {		define i32 @fdiv(i32 %arg) {
; SSE2: cost of 23 {{.*}} %F32 = fdiv		; SSE2: cost of 23 {{.*}} %F32 = fdiv
Show All 13 Lines	define i32 @fdiv(i32 %arg) {
; AVX: cost of 28 {{.*}} %V8F32 = fdiv		; AVX: cost of 28 {{.*}} %V8F32 = fdiv
; AVX2: cost of 14 {{.*}} %V8F32 = fdiv		; AVX2: cost of 14 {{.*}} %V8F32 = fdiv
; AVX512: cost of 14 {{.*}} %V8F32 = fdiv		; AVX512: cost of 14 {{.*}} %V8F32 = fdiv
%V8F32 = fdiv <8 x float> undef, undef		%V8F32 = fdiv <8 x float> undef, undef
; SSE2: cost of 156 {{.*}} %V16F32 = fdiv		; SSE2: cost of 156 {{.*}} %V16F32 = fdiv
; SSE42: cost of 56 {{.*}} %V16F32 = fdiv		; SSE42: cost of 56 {{.*}} %V16F32 = fdiv
; AVX: cost of 56 {{.*}} %V16F32 = fdiv		; AVX: cost of 56 {{.*}} %V16F32 = fdiv
; AVX2: cost of 28 {{.*}} %V16F32 = fdiv		; AVX2: cost of 28 {{.*}} %V16F32 = fdiv
; AVX512: cost of 2 {{.*}} %V16F32 = fdiv		; AVX512: cost of 1 {{.*}} %V16F32 = fdiv
%V16F32 = fdiv <16 x float> undef, undef		%V16F32 = fdiv <16 x float> undef, undef

; SSE2: cost of 38 {{.*}} %F64 = fdiv		; SSE2: cost of 38 {{.*}} %F64 = fdiv
; SSE42: cost of 22 {{.*}} %F64 = fdiv		; SSE42: cost of 22 {{.*}} %F64 = fdiv
; AVX: cost of 22 {{.*}} %F64 = fdiv		; AVX: cost of 22 {{.*}} %F64 = fdiv
; AVX2: cost of 14 {{.*}} %F64 = fdiv		; AVX2: cost of 14 {{.*}} %F64 = fdiv
; AVX512: cost of 14 {{.*}} %F64 = fdiv		; AVX512: cost of 14 {{.*}} %F64 = fdiv
%F64 = fdiv double undef, undef		%F64 = fdiv double undef, undef
; SSE2: cost of 69 {{.*}} %V2F64 = fdiv		; SSE2: cost of 69 {{.*}} %V2F64 = fdiv
; SSE42: cost of 22 {{.*}} %V2F64 = fdiv		; SSE42: cost of 22 {{.*}} %V2F64 = fdiv
; AVX: cost of 22 {{.*}} %V2F64 = fdiv		; AVX: cost of 22 {{.*}} %V2F64 = fdiv
; AVX2: cost of 14 {{.*}} %V2F64 = fdiv		; AVX2: cost of 14 {{.*}} %V2F64 = fdiv
; AVX512: cost of 14 {{.*}} %V2F64 = fdiv		; AVX512: cost of 14 {{.*}} %V2F64 = fdiv
%V2F64 = fdiv <2 x double> undef, undef		%V2F64 = fdiv <2 x double> undef, undef
; SSE2: cost of 138 {{.*}} %V4F64 = fdiv		; SSE2: cost of 138 {{.*}} %V4F64 = fdiv
; SSE42: cost of 44 {{.*}} %V4F64 = fdiv		; SSE42: cost of 44 {{.*}} %V4F64 = fdiv
; AVX: cost of 44 {{.*}} %V4F64 = fdiv		; AVX: cost of 44 {{.*}} %V4F64 = fdiv
; AVX2: cost of 28 {{.*}} %V4F64 = fdiv		; AVX2: cost of 28 {{.*}} %V4F64 = fdiv
; AVX512: cost of 28 {{.*}} %V4F64 = fdiv		; AVX512: cost of 28 {{.*}} %V4F64 = fdiv
%V4F64 = fdiv <4 x double> undef, undef		%V4F64 = fdiv <4 x double> undef, undef
; SSE2: cost of 276 {{.*}} %V8F64 = fdiv		; SSE2: cost of 276 {{.*}} %V8F64 = fdiv
; SSE42: cost of 88 {{.*}} %V8F64 = fdiv		; SSE42: cost of 88 {{.*}} %V8F64 = fdiv
; AVX: cost of 88 {{.*}} %V8F64 = fdiv		; AVX: cost of 88 {{.*}} %V8F64 = fdiv
; AVX2: cost of 56 {{.*}} %V8F64 = fdiv		; AVX2: cost of 56 {{.*}} %V8F64 = fdiv
; AVX512: cost of 2 {{.*}} %V8F64 = fdiv		; AVX512: cost of 1 {{.*}} %V8F64 = fdiv
%V8F64 = fdiv <8 x double> undef, undef		%V8F64 = fdiv <8 x double> undef, undef

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'frem'		; CHECK-LABEL: 'frem'
define i32 @frem(i32 %arg) {		define i32 @frem(i32 %arg) {
; SSE2: cost of 2 {{.*}} %F32 = frem		; SSE2: cost of 1 {{.*}} %F32 = frem
; SSE42: cost of 2 {{.*}} %F32 = frem		; SSE42: cost of 1 {{.*}} %F32 = frem
; AVX: cost of 2 {{.*}} %F32 = frem		; AVX: cost of 1 {{.*}} %F32 = frem
; AVX2: cost of 2 {{.*}} %F32 = frem		; AVX2: cost of 1 {{.*}} %F32 = frem
; AVX512: cost of 2 {{.*}} %F32 = frem		; AVX512: cost of 1 {{.*}} %F32 = frem
%F32 = frem float undef, undef		%F32 = frem float undef, undef
; SSE2: cost of 14 {{.*}} %V4F32 = frem		; SSE2: cost of 10 {{.*}} %V4F32 = frem
; SSE42: cost of 14 {{.*}} %V4F32 = frem		; SSE42: cost of 10 {{.*}} %V4F32 = frem
; AVX: cost of 14 {{.*}} %V4F32 = frem		; AVX: cost of 10 {{.*}} %V4F32 = frem
; AVX2: cost of 14 {{.*}} %V4F32 = frem		; AVX2: cost of 10 {{.*}} %V4F32 = frem
; AVX512: cost of 14 {{.*}} %V4F32 = frem		; AVX512: cost of 10 {{.*}} %V4F32 = frem
%V4F32 = frem <4 x float> undef, undef		%V4F32 = frem <4 x float> undef, undef
; SSE2: cost of 28 {{.*}} %V8F32 = frem		; SSE2: cost of 20 {{.*}} %V8F32 = frem
; SSE42: cost of 28 {{.*}} %V8F32 = frem		; SSE42: cost of 20 {{.*}} %V8F32 = frem
; AVX: cost of 30 {{.*}} %V8F32 = frem		; AVX: cost of 22 {{.*}} %V8F32 = frem
; AVX2: cost of 30 {{.*}} %V8F32 = frem		; AVX2: cost of 22 {{.*}} %V8F32 = frem
; AVX512: cost of 30 {{.*}} %V8F32 = frem		; AVX512: cost of 22 {{.*}} %V8F32 = frem
%V8F32 = frem <8 x float> undef, undef		%V8F32 = frem <8 x float> undef, undef
; SSE2: cost of 56 {{.*}} %V16F32 = frem		; SSE2: cost of 40 {{.*}} %V16F32 = frem
; SSE42: cost of 56 {{.*}} %V16F32 = frem		; SSE42: cost of 40 {{.*}} %V16F32 = frem
; AVX: cost of 60 {{.*}} %V16F32 = frem		; AVX: cost of 44 {{.*}} %V16F32 = frem
; AVX2: cost of 60 {{.*}} %V16F32 = frem		; AVX2: cost of 44 {{.*}} %V16F32 = frem
; AVX512: cost of 62 {{.*}} %V16F32 = frem		; AVX512: cost of 46 {{.*}} %V16F32 = frem
%V16F32 = frem <16 x float> undef, undef		%V16F32 = frem <16 x float> undef, undef

; SSE2: cost of 2 {{.*}} %F64 = frem		; SSE2: cost of 1 {{.*}} %F64 = frem
; SSE42: cost of 2 {{.*}} %F64 = frem		; SSE42: cost of 1 {{.*}} %F64 = frem
; AVX: cost of 2 {{.*}} %F64 = frem		; AVX: cost of 1 {{.*}} %F64 = frem
; AVX2: cost of 2 {{.*}} %F64 = frem		; AVX2: cost of 1 {{.*}} %F64 = frem
; AVX512: cost of 2 {{.*}} %F64 = frem		; AVX512: cost of 1 {{.*}} %F64 = frem
%F64 = frem double undef, undef		%F64 = frem double undef, undef
; SSE2: cost of 6 {{.*}} %V2F64 = frem		; SSE2: cost of 4 {{.*}} %V2F64 = frem
; SSE42: cost of 6 {{.*}} %V2F64 = frem		; SSE42: cost of 4 {{.*}} %V2F64 = frem
; AVX: cost of 6 {{.*}} %V2F64 = frem		; AVX: cost of 4 {{.*}} %V2F64 = frem
; AVX2: cost of 6 {{.*}} %V2F64 = frem		; AVX2: cost of 4 {{.*}} %V2F64 = frem
; AVX512: cost of 6 {{.*}} %V2F64 = frem		; AVX512: cost of 4 {{.*}} %V2F64 = frem
%V2F64 = frem <2 x double> undef, undef		%V2F64 = frem <2 x double> undef, undef
; SSE2: cost of 12 {{.*}} %V4F64 = frem		; SSE2: cost of 8 {{.*}} %V4F64 = frem
; SSE42: cost of 12 {{.*}} %V4F64 = frem		; SSE42: cost of 8 {{.*}} %V4F64 = frem
; AVX: cost of 14 {{.*}} %V4F64 = frem		; AVX: cost of 10 {{.*}} %V4F64 = frem
; AVX2: cost of 14 {{.*}} %V4F64 = frem		; AVX2: cost of 10 {{.*}} %V4F64 = frem
; AVX512: cost of 14 {{.*}} %V4F64 = frem		; AVX512: cost of 10 {{.*}} %V4F64 = frem
%V4F64 = frem <4 x double> undef, undef		%V4F64 = frem <4 x double> undef, undef
; SSE2: cost of 24 {{.*}} %V8F64 = frem		; SSE2: cost of 16 {{.*}} %V8F64 = frem
; SSE42: cost of 24 {{.*}} %V8F64 = frem		; SSE42: cost of 16 {{.*}} %V8F64 = frem
; AVX: cost of 28 {{.*}} %V8F64 = frem		; AVX: cost of 20 {{.*}} %V8F64 = frem
; AVX2: cost of 28 {{.*}} %V8F64 = frem		; AVX2: cost of 20 {{.*}} %V8F64 = frem
; AVX512: cost of 30 {{.*}} %V8F64 = frem		; AVX512: cost of 22 {{.*}} %V8F64 = frem
%V8F64 = frem <8 x double> undef, undef		%V8F64 = frem <8 x double> undef, undef

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'fsqrt'		; CHECK-LABEL: 'fsqrt'
define i32 @fsqrt(i32 %arg) {		define i32 @fsqrt(i32 %arg) {
; SSE2: cost of 28 {{.*}} %F32 = call float @llvm.sqrt.f32		; SSE2: cost of 28 {{.*}} %F32 = call float @llvm.sqrt.f32
▲ Show 20 Lines • Show All 256 Lines • Show Last 20 Lines

llvm/trunk/test/Analysis/CostModel/X86/intrinsic-cost.ll

Show First 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	vector.body: ; preds = %vector.body, %vector.ph
%index.next = add i64 %index, 4		%index.next = add i64 %index, 4
%3 = icmp eq i64 %index.next, 1024		%3 = icmp eq i64 %index.next, 1024
br i1 %3, label %for.end, label %vector.body		br i1 %3, label %for.end, label %vector.body

for.end: ; preds = %vector.body		for.end: ; preds = %vector.body
ret void		ret void

; CORE2: Printing analysis 'Cost Model Analysis' for function 'test3':		; CORE2: Printing analysis 'Cost Model Analysis' for function 'test3':
; CORE2: Cost Model: Found an estimated cost of 4 for instruction: %2 = call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %wide.load, <4 x float> %b, <4 x float> %c)		; CORE2: Cost Model: Found an estimated cost of 2 for instruction: %2 = call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %wide.load, <4 x float> %b, <4 x float> %c)

; COREI7: Printing analysis 'Cost Model Analysis' for function 'test3':		; COREI7: Printing analysis 'Cost Model Analysis' for function 'test3':
; COREI7: Cost Model: Found an estimated cost of 4 for instruction: %2 = call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %wide.load, <4 x float> %b, <4 x float> %c)		; COREI7: Cost Model: Found an estimated cost of 2 for instruction: %2 = call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %wide.load, <4 x float> %b, <4 x float> %c)

}		}

declare <4 x float> @llvm.fmuladd.v4f32(<4 x float>, <4 x float>, <4 x float>) nounwind readnone		declare <4 x float> @llvm.fmuladd.v4f32(<4 x float>, <4 x float>, <4 x float>) nounwind readnone

llvm/trunk/test/Analysis/CostModel/X86/reduction.ll

; RUN: opt < %s -cost-model -costmodel-reduxcost=true -analyze -mcpu=core2 -mtriple=x86_64-apple-darwin \| FileCheck %s		; RUN: opt < %s -cost-model -costmodel-reduxcost=true -analyze -mcpu=core2 -mtriple=x86_64-apple-darwin \| FileCheck %s
; RUN: opt < %s -cost-model -costmodel-reduxcost=true -analyze -mcpu=corei7 -mtriple=x86_64-apple-darwin \| FileCheck %s --check-prefix=SSE3		; RUN: opt < %s -cost-model -costmodel-reduxcost=true -analyze -mcpu=corei7 -mtriple=x86_64-apple-darwin \| FileCheck %s --check-prefix=SSE3
; RUN: opt < %s -cost-model -costmodel-reduxcost=true -analyze -mcpu=corei7-avx -mtriple=x86_64-apple-darwin \| FileCheck %s --check-prefix=AVX		; RUN: opt < %s -cost-model -costmodel-reduxcost=true -analyze -mcpu=corei7-avx -mtriple=x86_64-apple-darwin \| FileCheck %s --check-prefix=AVX
; RUN: opt < %s -cost-model -costmodel-reduxcost=true -analyze -mcpu=core-avx2 -mtriple=x86_64-apple-darwin \| FileCheck %s --check-prefix=AVX2		; RUN: opt < %s -cost-model -costmodel-reduxcost=true -analyze -mcpu=core-avx2 -mtriple=x86_64-apple-darwin \| FileCheck %s --check-prefix=AVX2

define fastcc float @reduction_cost_float(<4 x float> %rdx) {		define fastcc float @reduction_cost_float(<4 x float> %rdx) {
%rdx.shuf = shufflevector <4 x float> %rdx, <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>		%rdx.shuf = shufflevector <4 x float> %rdx, <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
%bin.rdx = fadd <4 x float> %rdx, %rdx.shuf		%bin.rdx = fadd <4 x float> %rdx, %rdx.shuf
%rdx.shuf7 = shufflevector <4 x float> %bin.rdx, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>		%rdx.shuf7 = shufflevector <4 x float> %bin.rdx, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
%bin.rdx8 = fadd <4 x float> %bin.rdx, %rdx.shuf7		%bin.rdx8 = fadd <4 x float> %bin.rdx, %rdx.shuf7

; Check that we recognize the tree starting at the extractelement as a		; Check that we recognize the tree starting at the extractelement as a
; reduction.		; reduction.
; CHECK-LABEL: reduction_cost		; CHECK-LABEL: reduction_cost_float
; CHECK: cost of 9 {{.*}} extractelement		; CHECK: cost of 7 {{.*}} extractelement

%r = extractelement <4 x float> %bin.rdx8, i32 0		%r = extractelement <4 x float> %bin.rdx8, i32 0
ret float %r		ret float %r
}		}

define fastcc i32 @reduction_cost_int(<8 x i32> %rdx) {		define fastcc i32 @reduction_cost_int(<8 x i32> %rdx) {
%rdx.shuf = shufflevector <8 x i32> %rdx, <8 x i32> undef,		%rdx.shuf = shufflevector <8 x i32> %rdx, <8 x i32> undef,
<8 x i32> <i32 4 , i32 5, i32 6, i32 7,		<8 x i32> <i32 4 , i32 5, i32 6, i32 7,
Show All 25 Lines	define fastcc float @pairwise_hadd(<4 x float> %rdx, float %f1) {
%bin.rdx.0 = fadd <4 x float> %rdx.shuf.0.0, %rdx.shuf.0.1		%bin.rdx.0 = fadd <4 x float> %rdx.shuf.0.0, %rdx.shuf.0.1
%rdx.shuf.1.0 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef,		%rdx.shuf.1.0 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef,
<4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>		<4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
%rdx.shuf.1.1 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef,		%rdx.shuf.1.1 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef,
<4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>		<4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
%bin.rdx.1 = fadd <4 x float> %rdx.shuf.1.0, %rdx.shuf.1.1		%bin.rdx.1 = fadd <4 x float> %rdx.shuf.1.0, %rdx.shuf.1.1

; CHECK-LABEL: pairwise_hadd		; CHECK-LABEL: pairwise_hadd
; CHECK: cost of 11 {{.*}} extractelement		; CHECK: cost of 9 {{.*}} extractelement

%r = extractelement <4 x float> %bin.rdx.1, i32 0		%r = extractelement <4 x float> %bin.rdx.1, i32 0
%r2 = fadd float %r, %f1		%r2 = fadd float %r, %f1
ret float %r2		ret float %r2
}		}

define fastcc float @pairwise_hadd_assoc(<4 x float> %rdx, float %f1) {		define fastcc float @pairwise_hadd_assoc(<4 x float> %rdx, float %f1) {
%rdx.shuf.0.0 = shufflevector <4 x float> %rdx, <4 x float> undef,		%rdx.shuf.0.0 = shufflevector <4 x float> %rdx, <4 x float> undef,
<4 x i32> <i32 0, i32 2 , i32 undef, i32 undef>		<4 x i32> <i32 0, i32 2 , i32 undef, i32 undef>
%rdx.shuf.0.1 = shufflevector <4 x float> %rdx, <4 x float> undef,		%rdx.shuf.0.1 = shufflevector <4 x float> %rdx, <4 x float> undef,
<4 x i32> <i32 1, i32 3, i32 undef, i32 undef>		<4 x i32> <i32 1, i32 3, i32 undef, i32 undef>
%bin.rdx.0 = fadd <4 x float> %rdx.shuf.0.1, %rdx.shuf.0.0		%bin.rdx.0 = fadd <4 x float> %rdx.shuf.0.1, %rdx.shuf.0.0
%rdx.shuf.1.0 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef,		%rdx.shuf.1.0 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef,
<4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>		<4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
%rdx.shuf.1.1 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef,		%rdx.shuf.1.1 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef,
<4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>		<4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
%bin.rdx.1 = fadd <4 x float> %rdx.shuf.1.0, %rdx.shuf.1.1		%bin.rdx.1 = fadd <4 x float> %rdx.shuf.1.0, %rdx.shuf.1.1

; CHECK-LABEL: pairwise_hadd_assoc		; CHECK-LABEL: pairwise_hadd_assoc
; CHECK: cost of 11 {{.*}} extractelement		; CHECK: cost of 9 {{.*}} extractelement

%r = extractelement <4 x float> %bin.rdx.1, i32 0		%r = extractelement <4 x float> %bin.rdx.1, i32 0
%r2 = fadd float %r, %f1		%r2 = fadd float %r, %f1
ret float %r2		ret float %r2
}		}

define fastcc float @pairwise_hadd_skip_first(<4 x float> %rdx, float %f1) {		define fastcc float @pairwise_hadd_skip_first(<4 x float> %rdx, float %f1) {
%rdx.shuf.0.0 = shufflevector <4 x float> %rdx, <4 x float> undef,		%rdx.shuf.0.0 = shufflevector <4 x float> %rdx, <4 x float> undef,
▲ Show 20 Lines • Show All 282 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/imprecise-through-phis.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -loop-vectorize -mtriple=x86_64-apple-darwin %s \| FileCheck %s			; RUN: opt -S -loop-vectorize -mtriple=x86_64-apple-darwin %s \| FileCheck %s

				; FIXME: The intent is that we should be able to vectorize this on x86
				; because that would be profitable, but the cost model says it is not.

	; Two mostly identical functions. The only difference is the presence of			; Two mostly identical functions. The only difference is the presence of
	; fast-math flags on the second. The loop is a pretty simple reduction:			; fast-math flags on the second. The loop is a pretty simple reduction:

	; for (int i = 0; i < 32; ++i)			; for (int i = 0; i < 32; ++i)
	; if (arr[i] != 42)			; if (arr[i] != 42.0)
	; tot += arr[i];			; tot += arr[i];

	define double @sumIfScalar(double* nocapture readonly %arr) {			define double @sumIfScalar(double* nocapture readonly %arr) {
	; CHECK-LABEL: @sumIfScalar(			; CHECK-LABEL: @sumIfScalar(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br label [[LOOP:%.*]]			; CHECK-NEXT: br label [[LOOP:%.*]]
	; CHECK: loop:			; CHECK: loop:
	; CHECK-NEXT: [[I:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[I_NEXT:%.]], [[NEXT_ITER:%.]] ]			; CHECK-NEXT: [[I:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[I_NEXT:%.]], [[NEXT_ITER:%.]] ]
	▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines

	done:			done:
	ret double %tot.next			ret double %tot.next
	}			}

	define double @sumIfVector(double* nocapture readonly %arr) {			define double @sumIfVector(double* nocapture readonly %arr) {
	; CHECK-LABEL: @sumIfVector(			; CHECK-LABEL: @sumIfVector(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <2 x double> [ zeroinitializer, [[VECTOR_PH]] ], [ [[PREDPHI:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i32> undef, i32 [[INDEX]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i32> [[BROADCAST_SPLATINSERT]], <2 x i32> undef, <2 x i32> zeroinitializer
	; CHECK-NEXT: [[INDUCTION:%.*]] = add <2 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1>
	; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
	; CHECK-NEXT: [[TMP1:%.]] = getelementptr double, double [[ARR:%.*]], i32 [[TMP0]]
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr double, double [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = bitcast double [[TMP2]] to <2 x double>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <2 x double>, <2 x double> [[TMP3]], align 8
	; CHECK-NEXT: [[TMP4:%.*]] = fcmp fast une <2 x double> [[WIDE_LOAD]], <double 4.200000e+01, double 4.200000e+01>
	; CHECK-NEXT: [[TMP5:%.*]] = fadd fast <2 x double> [[VEC_PHI]], [[WIDE_LOAD]]
	; CHECK-NEXT: [[TMP6:%.*]] = xor <2 x i1> [[TMP4]], <i1 true, i1 true>
	; CHECK-NEXT: [[PREDPHI]] = select <2 x i1> [[TMP4]], <2 x double> [[TMP5]], <2 x double> [[VEC_PHI]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 2
	; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 32
	; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
	; CHECK: middle.block:
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <2 x double> [[PREDPHI]], <2 x double> undef, <2 x i32> <i32 1, i32 undef>
	; CHECK-NEXT: [[BIN_RDX:%.*]] = fadd fast <2 x double> [[PREDPHI]], [[RDX_SHUF]]
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x double> [[BIN_RDX]], i32 0
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 32, 32
	; CHECK-NEXT: br i1 [[CMP_N]], label [[DONE:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 32, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi double [ 0.000000e+00, [[ENTRY]] ], [ [[TMP8]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: br label [[LOOP:%.*]]			; CHECK-NEXT: br label [[LOOP:%.*]]
	; CHECK: loop:			; CHECK: loop:
	; CHECK-NEXT: [[I:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[I_NEXT:%.]], [[NEXT_ITER:%.*]] ]			; CHECK-NEXT: [[I:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[I_NEXT:%.]], [[NEXT_ITER:%.]] ]
	; CHECK-NEXT: [[TOT:%.]] = phi double [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ], [ [[TOT_NEXT:%.]], [[NEXT_ITER]] ]			; CHECK-NEXT: [[TOT:%.]] = phi double [ 0.000000e+00, [[ENTRY]] ], [ [[TOT_NEXT:%.]], [[NEXT_ITER]] ]
	; CHECK-NEXT: [[ADDR:%.]] = getelementptr double, double [[ARR]], i32 [[I]]			; CHECK-NEXT: [[ADDR:%.]] = getelementptr double, double [[ARR:%.*]], i32 [[I]]
	; CHECK-NEXT: [[NEXTVAL:%.]] = load double, double [[ADDR]]			; CHECK-NEXT: [[NEXTVAL:%.]] = load double, double [[ADDR]]
	; CHECK-NEXT: [[TST:%.*]] = fcmp fast une double [[NEXTVAL]], 4.200000e+01			; CHECK-NEXT: [[TST:%.*]] = fcmp fast une double [[NEXTVAL]], 4.200000e+01
	; CHECK-NEXT: br i1 [[TST]], label [[DO_ADD:%.]], label [[NO_ADD:%.]]			; CHECK-NEXT: br i1 [[TST]], label [[DO_ADD:%.]], label [[NO_ADD:%.]]
	; CHECK: do.add:			; CHECK: do.add:
	; CHECK-NEXT: [[TOT_NEW:%.*]] = fadd fast double [[TOT]], [[NEXTVAL]]			; CHECK-NEXT: [[TOT_NEW:%.*]] = fadd fast double [[TOT]], [[NEXTVAL]]
	; CHECK-NEXT: br label [[NEXT_ITER]]			; CHECK-NEXT: br label [[NEXT_ITER]]
	; CHECK: no.add:			; CHECK: no.add:
	; CHECK-NEXT: br label [[NEXT_ITER]]			; CHECK-NEXT: br label [[NEXT_ITER]]
	; CHECK: next.iter:			; CHECK: next.iter:
	; CHECK-NEXT: [[TOT_NEXT]] = phi double [ [[TOT]], [[NO_ADD]] ], [ [[TOT_NEW]], [[DO_ADD]] ]			; CHECK-NEXT: [[TOT_NEXT]] = phi double [ [[TOT]], [[NO_ADD]] ], [ [[TOT_NEW]], [[DO_ADD]] ]
	; CHECK-NEXT: [[I_NEXT]] = add i32 [[I]], 1			; CHECK-NEXT: [[I_NEXT]] = add i32 [[I]], 1
	; CHECK-NEXT: [[AGAIN:%.*]] = icmp ult i32 [[I_NEXT]], 32			; CHECK-NEXT: [[AGAIN:%.*]] = icmp ult i32 [[I_NEXT]], 32
	; CHECK-NEXT: br i1 [[AGAIN]], label [[LOOP]], label [[DONE]], !llvm.loop !2			; CHECK-NEXT: br i1 [[AGAIN]], label [[LOOP]], label [[DONE:%.*]]
	; CHECK: done:			; CHECK: done:
	; CHECK-NEXT: [[TOT_NEXT_LCSSA:%.*]] = phi double [ [[TOT_NEXT]], [[NEXT_ITER]] ], [ [[TMP8]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[TOT_NEXT_LCSSA:%.*]] = phi double [ [[TOT_NEXT]], [[NEXT_ITER]] ]
	; CHECK-NEXT: ret double [[TOT_NEXT_LCSSA]]			; CHECK-NEXT: ret double [[TOT_NEXT_LCSSA]]
	;			;
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%i = phi i32 [0, %entry], [%i.next, %next.iter]			%i = phi i32 [0, %entry], [%i.next, %next.iter]
	%tot = phi double [0.0, %entry], [%tot.next, %next.iter]			%tot = phi double [0.0, %entry], [%tot.next, %next.iter]
	Show All 24 Lines

llvm/trunk/test/Transforms/SLPVectorizer/AArch64/remarks.ll

	; RUN: opt -S -slp-vectorizer -mtriple=aarch64--linux-gnu -mcpu=generic -pass-remarks=slp-vectorizer -o /dev/null < %s 2>&1 \| FileCheck %s			; RUN: opt -S -slp-vectorizer -mtriple=aarch64--linux-gnu -mcpu=generic -pass-remarks=slp-vectorizer -o /dev/null < %s 2>&1 \| FileCheck %s

	define void @f(double* %r, double* %w) {			define void @f(double* %r, double* %w) {
	%r0 = getelementptr inbounds double, double* %r, i64 0			%r0 = getelementptr inbounds double, double* %r, i64 0
	%r1 = getelementptr inbounds double, double* %r, i64 1			%r1 = getelementptr inbounds double, double* %r, i64 1
	%f0 = load double, double* %r0			%f0 = load double, double* %r0
	%f1 = load double, double* %r1			%f1 = load double, double* %r1
	%add0 = fadd double %f0, %f0			%add0 = fadd double %f0, %f0
	%add1 = fadd double %f1, %f1			%add1 = fadd double %f1, %f1
	%w0 = getelementptr inbounds double, double* %w, i64 0			%w0 = getelementptr inbounds double, double* %w, i64 0
	%w1 = getelementptr inbounds double, double* %w, i64 1			%w1 = getelementptr inbounds double, double* %w, i64 1
	; CHECK: remark: /tmp/s.c:5:10: Stores SLP vectorized with cost -4 and with tree size 3			; CHECK: remark: /tmp/s.c:5:10: Stores SLP vectorized with cost -3 and with tree size 3
	store double %add0, double* %w0, !dbg !9			store double %add0, double* %w0, !dbg !9
	store double %add1, double* %w1			store double %add1, double* %w1
	ret void			ret void
	}			}


	!llvm.dbg.cu = !{!0}			!llvm.dbg.cu = !{!0}
	!llvm.module.flags = !{!3, !4, !5}			!llvm.module.flags = !{!3, !4, !5}
	Show All 12 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/PR36280.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 \| FileCheck %s			; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 \| FileCheck %s

				; It is not profitable to vectorize this with <2 x float> ops.
				; This is a reduction from the Himeno benchmark.
				; https://bugs.llvm.org/show_bug.cgi?id=36280

	define float @jacobi(float* %p, float %x, float %y, float %z) {			define float @jacobi(float* %p, float %x, float %y, float %z) {
	; CHECK-LABEL: @jacobi(			; CHECK-LABEL: @jacobi(
	; CHECK-NEXT: [[GEP1:%.]] = getelementptr float, float [[P:%.*]], i64 1			; CHECK-NEXT: [[GEP1:%.]] = getelementptr float, float [[P:%.*]], i64 1
	; CHECK-NEXT: [[GEP2:%.]] = getelementptr float, float [[P]], i64 2			; CHECK-NEXT: [[GEP2:%.]] = getelementptr float, float [[P]], i64 2
	; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[GEP1]] to <2 x float>*			; CHECK-NEXT: [[P1:%.]] = load float, float [[GEP1]]
	; CHECK-NEXT: [[TMP2:%.]] = load <2 x float>, <2 x float> [[TMP1]], align 4			; CHECK-NEXT: [[P2:%.]] = load float, float [[GEP2]]
	; CHECK-NEXT: [[TMP3:%.]] = insertelement <2 x float> undef, float [[X:%.]], i32 0			; CHECK-NEXT: [[MUL1:%.]] = fmul float [[P1]], [[X:%.]]
	; CHECK-NEXT: [[TMP4:%.]] = insertelement <2 x float> [[TMP3]], float [[Y:%.]], i32 1			; CHECK-NEXT: [[MUL2:%.]] = fmul float [[P2]], [[Y:%.]]
	; CHECK-NEXT: [[TMP5:%.*]] = fmul <2 x float> [[TMP4]], [[TMP2]]			; CHECK-NEXT: [[ADD1:%.]] = fadd float [[MUL1]], [[Z:%.]]
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x float> [[TMP5]], i32 0			; CHECK-NEXT: [[ADD2:%.*]] = fadd float [[MUL2]], [[ADD1]]
	; CHECK-NEXT: [[ADD1:%.]] = fadd float [[TMP6]], [[Z:%.]]
	; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x float> [[TMP5]], i32 1
	; CHECK-NEXT: [[ADD2:%.*]] = fadd float [[TMP7]], [[ADD1]]
	; CHECK-NEXT: ret float [[ADD2]]			; CHECK-NEXT: ret float [[ADD2]]
	;			;
	%gep1 = getelementptr float, float* %p, i64 1			%gep1 = getelementptr float, float* %p, i64 1
	%gep2 = getelementptr float, float* %p, i64 2			%gep2 = getelementptr float, float* %p, i64 2
	%p1 = load float, float* %gep1			%p1 = load float, float* %gep1
	%p2 = load float, float* %gep2			%p2 = load float, float* %gep2
	%mul1 = fmul float %p1, %x			%mul1 = fmul float %p1, %x
	%mul2 = fmul float %p2, %y			%mul2 = fmul float %p2, %y
	%add1 = fadd float %mul1, %z			%add1 = fadd float %mul1, %z
	%add2 = fadd float %mul2, %add1			%add2 = fadd float %mul2, %add1
	ret float %add2			ret float %add2
	}			}

llvm/trunk/test/Transforms/SLPVectorizer/X86/cse.ll

	Show All 13 Lines
	define i32 @test(double* nocapture %G) {			define i32 @test(double* nocapture %G) {
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds double, double [[G:%.*]], i64 5			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds double, double [[G:%.*]], i64 5
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds double, double [[G]], i64 6			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds double, double [[G]], i64 6
	; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[ARRAYIDX]] to <2 x double>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[ARRAYIDX]] to <2 x double>*
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8			; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
	; CHECK-NEXT: [[TMP2:%.*]] = fmul <2 x double> <double 4.000000e+00, double 3.000000e+00>, [[TMP1]]			; CHECK-NEXT: [[TMP2:%.*]] = fmul <2 x double> <double 4.000000e+00, double 3.000000e+00>, [[TMP1]]
				; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> <double 1.000000e+00, double 6.000000e+00>, [[TMP2]]
	; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[G]], i64 1			; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[G]], i64 1
				; CHECK-NEXT: [[TMP4:%.]] = bitcast double [[G]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 8
				; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[TMP2]], i32 0
				; CHECK-NEXT: [[ADD8:%.*]] = fadd double [[TMP5]], 7.000000e+00
	; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds double, double [[G]], i64 2			; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds double, double [[G]], i64 2
	; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x double> [[TMP1]], i32 1			; CHECK-NEXT: store double [[ADD8]], double* [[ARRAYIDX9]], align 8
	; CHECK-NEXT: [[MUL11:%.*]] = fmul double [[TMP3]], 4.000000e+00			; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[TMP1]], i32 1
	; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x double> [[TMP2]], i32 0			; CHECK-NEXT: [[MUL11:%.*]] = fmul double [[TMP6]], 4.000000e+00
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x double> undef, double [[TMP4]], i32 0			; CHECK-NEXT: [[ADD12:%.*]] = fadd double [[MUL11]], 8.000000e+00
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[TMP2]], i32 1
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x double> [[TMP5]], double [[TMP6]], i32 1
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x double> [[TMP7]], double [[TMP4]], i32 2
	; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x double> [[TMP8]], double [[MUL11]], i32 3
	; CHECK-NEXT: [[TMP10:%.*]] = fadd <4 x double> <double 1.000000e+00, double 6.000000e+00, double 7.000000e+00, double 8.000000e+00>, [[TMP9]]
	; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds double, double [[G]], i64 3			; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds double, double [[G]], i64 3
	; CHECK-NEXT: [[TMP11:%.]] = bitcast double [[G]] to <4 x double>*			; CHECK-NEXT: store double [[ADD12]], double* [[ARRAYIDX13]], align 8
	; CHECK-NEXT: store <4 x double> [[TMP10]], <4 x double>* [[TMP11]], align 8
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	entry:			entry:
	%arrayidx = getelementptr inbounds double, double* %G, i64 5			%arrayidx = getelementptr inbounds double, double* %G, i64 5
	%0 = load double, double* %arrayidx, align 8			%0 = load double, double* %arrayidx, align 8
	%mul = fmul double %0, 4.000000e+00			%mul = fmul double %0, 4.000000e+00
	%add = fadd double %mul, 1.000000e+00			%add = fadd double %mul, 1.000000e+00
	store double %add, double* %G, align 8			store double %add, double* %G, align 8
	▲ Show 20 Lines • Show All 320 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/horizontal.ll

	Show First 20 Lines • Show All 724 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[CMP1495:%.]] = icmp eq i32 [[ARG_B:%.]], 0			; CHECK-NEXT: [[CMP1495:%.]] = icmp eq i32 [[ARG_B:%.]], 0
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.cond.cleanup:			; CHECK: for.cond.cleanup:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_COND_CLEANUP15:%.]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_COND_CLEANUP15:%.]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = shl i64 [[INDVARS_IV]], 2			; CHECK-NEXT: [[TMP0:%.*]] = shl i64 [[INDVARS_IV]], 2
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[ARRAY:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[ARRAY:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP1:%.*]] = or i64 [[TMP0]], 1			; CHECK-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX]], align 4
	; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP1]]			; CHECK-NEXT: [[TMP2:%.*]] = or i64 [[TMP0]], 1
	; CHECK-NEXT: [[TMP2:%.*]] = or i64 [[TMP0]], 2			; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP2]]
	; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP2]]			; CHECK-NEXT: [[TMP3:%.]] = load float, float [[ARRAYIDX4]], align 4
	; CHECK-NEXT: [[TMP3:%.*]] = or i64 [[TMP0]], 3			; CHECK-NEXT: [[TMP4:%.*]] = or i64 [[TMP0]], 2
	; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP3]]			; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP4]]
	; CHECK-NEXT: [[TMP4:%.]] = bitcast float [[ARRAYIDX]] to <4 x float>*			; CHECK-NEXT: [[TMP5:%.]] = load float, float [[ARRAYIDX8]], align 4
	; CHECK-NEXT: [[TMP5:%.]] = load <4 x float>, <4 x float> [[TMP4]], align 4			; CHECK-NEXT: [[TMP6:%.*]] = or i64 [[TMP0]], 3
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <4 x float> [[TMP5]], i32 0			; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP6]]
	; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x float> [[TMP5]], i32 1			; CHECK-NEXT: [[TMP7:%.]] = load float, float [[ARRAYIDX12]], align 4
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <4 x float> [[TMP5]], i32 2
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x float> [[TMP5]], i32 3
	; CHECK-NEXT: br i1 [[CMP1495]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16_LR_PH:%.*]]			; CHECK-NEXT: br i1 [[CMP1495]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16_LR_PH:%.*]]
	; CHECK: for.body16.lr.ph:			; CHECK: for.body16.lr.ph:
	; CHECK-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds float, float [[ARG_A:%.*]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds float, float [[ARG_A:%.*]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ADD_PTR]], align 4			; CHECK-NEXT: [[TMP8:%.]] = load float, float [[ADD_PTR]], align 4
	; CHECK-NEXT: br label [[FOR_BODY16:%.*]]			; CHECK-NEXT: br label [[FOR_BODY16:%.*]]
	; CHECK: for.cond.cleanup15:			; CHECK: for.cond.cleanup15:
	; CHECK-NEXT: [[W2_0_LCSSA:%.]] = phi float [ [[TMP8]], [[FOR_BODY]] ], [ [[TMP21:%.]], [[FOR_BODY16]] ]			; CHECK-NEXT: [[W2_0_LCSSA:%.]] = phi float [ [[TMP5]], [[FOR_BODY]] ], [ [[SUB28:%.]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[W3_0_LCSSA:%.]] = phi float [ [[TMP9]], [[FOR_BODY]] ], [ [[TMP25:%.]], [[FOR_BODY16]] ]			; CHECK-NEXT: [[W3_0_LCSSA:%.]] = phi float [ [[TMP7]], [[FOR_BODY]] ], [ [[W2_096:%.]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[W1_0_LCSSA:%.]] = phi float [ [[TMP7]], [[FOR_BODY]] ], [ [[TMP12:%.]], [[FOR_BODY16]] ]			; CHECK-NEXT: [[W1_0_LCSSA:%.]] = phi float [ [[TMP3]], [[FOR_BODY]] ], [ [[W0_0100:%.]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[W0_0_LCSSA:%.]] = phi float [ [[TMP6]], [[FOR_BODY]] ], [ [[SUB19:%.]], [[FOR_BODY16]] ]			; CHECK-NEXT: [[W0_0_LCSSA:%.]] = phi float [ [[TMP1]], [[FOR_BODY]] ], [ [[SUB19:%.]], [[FOR_BODY16]] ]
	; CHECK-NEXT: store float [[W0_0_LCSSA]], float* [[ARRAYIDX]], align 4			; CHECK-NEXT: store float [[W0_0_LCSSA]], float* [[ARRAYIDX]], align 4
	; CHECK-NEXT: store float [[W1_0_LCSSA]], float* [[ARRAYIDX4]], align 4			; CHECK-NEXT: store float [[W1_0_LCSSA]], float* [[ARRAYIDX4]], align 4
	; CHECK-NEXT: store float [[W2_0_LCSSA]], float* [[ARRAYIDX8]], align 4			; CHECK-NEXT: store float [[W2_0_LCSSA]], float* [[ARRAYIDX8]], align 4
	; CHECK-NEXT: store float [[W3_0_LCSSA]], float* [[ARRAYIDX12]], align 4			; CHECK-NEXT: store float [[W3_0_LCSSA]], float* [[ARRAYIDX12]], align 4
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND109:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 6			; CHECK-NEXT: [[EXITCOND109:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 6
	; CHECK-NEXT: br i1 [[EXITCOND109]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]			; CHECK-NEXT: br i1 [[EXITCOND109]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
	; CHECK: for.body16:			; CHECK: for.body16:
				; CHECK-NEXT: [[W0_0100]] = phi float [ [[TMP1]], [[FOR_BODY16_LR_PH]] ], [ [[SUB19]], [[FOR_BODY16]] ]
				; CHECK-NEXT: [[W1_099:%.*]] = phi float [ [[TMP3]], [[FOR_BODY16_LR_PH]] ], [ [[W0_0100]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[J_098:%.]] = phi i32 [ 0, [[FOR_BODY16_LR_PH]] ], [ [[INC:%.]], [[FOR_BODY16]] ]			; CHECK-NEXT: [[J_098:%.]] = phi i32 [ 0, [[FOR_BODY16_LR_PH]] ], [ [[INC:%.]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[TMP11:%.]] = phi <4 x float> [ [[TMP5]], [[FOR_BODY16_LR_PH]] ], [ [[TMP26:%.]], [[FOR_BODY16]] ]			; CHECK-NEXT: [[W3_097:%.*]] = phi float [ [[TMP7]], [[FOR_BODY16_LR_PH]] ], [ [[W2_096]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[TMP12]] = extractelement <4 x float> [[TMP11]], i32 0			; CHECK-NEXT: [[W2_096]] = phi float [ [[TMP5]], [[FOR_BODY16_LR_PH]] ], [ [[SUB28]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x float> [[TMP11]], i32 1			; CHECK-NEXT: [[MUL17:%.*]] = fmul fast float [[W0_0100]], 0x3FF19999A0000000
	; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x float> undef, float [[TMP12]], i32 0			; CHECK-NEXT: [[MUL18_NEG:%.*]] = fmul fast float [[W1_099]], 0xBFF3333340000000
	; CHECK-NEXT: [[TMP15:%.*]] = insertelement <2 x float> [[TMP14]], float [[TMP13]], i32 1			; CHECK-NEXT: [[SUB92:%.*]] = fadd fast float [[MUL17]], [[MUL18_NEG]]
	; CHECK-NEXT: [[TMP16:%.*]] = fmul fast <2 x float> <float 0x3FF19999A0000000, float 0xBFF3333340000000>, [[TMP15]]			; CHECK-NEXT: [[SUB19]] = fadd fast float [[SUB92]], [[TMP8]]
	; CHECK-NEXT: [[TMP17:%.*]] = extractelement <2 x float> [[TMP16]], i32 0
	; CHECK-NEXT: [[TMP18:%.*]] = extractelement <2 x float> [[TMP16]], i32 1
	; CHECK-NEXT: [[SUB92:%.*]] = fadd fast float [[TMP17]], [[TMP18]]
	; CHECK-NEXT: [[SUB19]] = fadd fast float [[SUB92]], [[TMP10]]
	; CHECK-NEXT: [[MUL20:%.*]] = fmul fast float [[SUB19]], 0x4000CCCCC0000000			; CHECK-NEXT: [[MUL20:%.*]] = fmul fast float [[SUB19]], 0x4000CCCCC0000000
	; CHECK-NEXT: [[TMP19:%.*]] = fmul fast <4 x float> <float 0xC0019999A0000000, float 0x4002666660000000, float 0x4008CCCCC0000000, float 0xC0099999A0000000>, [[TMP11]]			; CHECK-NEXT: [[MUL21_NEG:%.*]] = fmul fast float [[W0_0100]], 0xC0019999A0000000
	; CHECK-NEXT: [[ADD2293:%.*]] = fadd fast float undef, undef			; CHECK-NEXT: [[MUL23:%.*]] = fmul fast float [[W1_099]], 0x4002666660000000
	; CHECK-NEXT: [[ADD24:%.*]] = fadd fast float [[ADD2293]], undef			; CHECK-NEXT: [[MUL25:%.*]] = fmul fast float [[W2_096]], 0x4008CCCCC0000000
	; CHECK-NEXT: [[SUB2694:%.*]] = fadd fast float [[ADD24]], undef			; CHECK-NEXT: [[MUL27_NEG:%.*]] = fmul fast float [[W3_097]], 0xC0099999A0000000
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <4 x float> [[TMP19]], <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>			; CHECK-NEXT: [[ADD2293:%.*]] = fadd fast float [[MUL27_NEG]], [[MUL25]]
	; CHECK-NEXT: [[BIN_RDX:%.*]] = fadd fast <4 x float> [[TMP19]], [[RDX_SHUF]]			; CHECK-NEXT: [[ADD24:%.*]] = fadd fast float [[ADD2293]], [[MUL23]]
	; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[SUB2694:%.*]] = fadd fast float [[ADD24]], [[MUL21_NEG]]
	; CHECK-NEXT: [[BIN_RDX2:%.*]] = fadd fast <4 x float> [[BIN_RDX]], [[RDX_SHUF1]]			; CHECK-NEXT: [[SUB28]] = fadd fast float [[SUB2694]], [[MUL20]]
	; CHECK-NEXT: [[TMP20:%.*]] = extractelement <4 x float> [[BIN_RDX2]], i32 0
	; CHECK-NEXT: [[TMP21]] = fadd fast float [[TMP20]], [[MUL20]]
	; CHECK-NEXT: [[SUB28:%.*]] = fadd fast float [[SUB2694]], [[MUL20]]
	; CHECK-NEXT: [[INC]] = add nuw i32 [[J_098]], 1			; CHECK-NEXT: [[INC]] = add nuw i32 [[J_098]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[ARG_B]]			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[ARG_B]]
	; CHECK-NEXT: [[TMP22:%.*]] = insertelement <4 x float> undef, float [[SUB19]], i32 0
	; CHECK-NEXT: [[TMP23:%.*]] = insertelement <4 x float> [[TMP22]], float [[TMP12]], i32 1
	; CHECK-NEXT: [[TMP24:%.*]] = insertelement <4 x float> [[TMP23]], float [[TMP21]], i32 2
	; CHECK-NEXT: [[TMP25]] = extractelement <4 x float> [[TMP11]], i32 2
	; CHECK-NEXT: [[TMP26]] = insertelement <4 x float> [[TMP24]], float [[TMP25]], i32 3
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16]]
	;			;
	; STORE-LABEL: @foo(			; STORE-LABEL: @foo(
	; STORE-NEXT: entry:			; STORE-NEXT: entry:
	; STORE-NEXT: [[CMP1495:%.]] = icmp eq i32 [[ARG_B:%.]], 0			; STORE-NEXT: [[CMP1495:%.]] = icmp eq i32 [[ARG_B:%.]], 0
	; STORE-NEXT: br label [[FOR_BODY:%.*]]			; STORE-NEXT: br label [[FOR_BODY:%.*]]
	; STORE: for.cond.cleanup:			; STORE: for.cond.cleanup:
	; STORE-NEXT: ret void			; STORE-NEXT: ret void
	; STORE: for.body:			; STORE: for.body:
	; STORE-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_COND_CLEANUP15:%.]] ]			; STORE-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_COND_CLEANUP15:%.]] ]
	; STORE-NEXT: [[TMP0:%.*]] = shl i64 [[INDVARS_IV]], 2			; STORE-NEXT: [[TMP0:%.*]] = shl i64 [[INDVARS_IV]], 2
	; STORE-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[ARRAY:%.*]], i64 [[TMP0]]			; STORE-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[ARRAY:%.*]], i64 [[TMP0]]
	; STORE-NEXT: [[TMP1:%.*]] = or i64 [[TMP0]], 1			; STORE-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX]], align 4
	; STORE-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP1]]			; STORE-NEXT: [[TMP2:%.*]] = or i64 [[TMP0]], 1
	; STORE-NEXT: [[TMP2:%.*]] = or i64 [[TMP0]], 2			; STORE-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP2]]
	; STORE-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP2]]			; STORE-NEXT: [[TMP3:%.]] = load float, float [[ARRAYIDX4]], align 4
	; STORE-NEXT: [[TMP3:%.*]] = or i64 [[TMP0]], 3			; STORE-NEXT: [[TMP4:%.*]] = or i64 [[TMP0]], 2
	; STORE-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP3]]			; STORE-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP4]]
	; STORE-NEXT: [[TMP4:%.]] = bitcast float [[ARRAYIDX]] to <4 x float>*			; STORE-NEXT: [[TMP5:%.]] = load float, float [[ARRAYIDX8]], align 4
	; STORE-NEXT: [[TMP5:%.]] = load <4 x float>, <4 x float> [[TMP4]], align 4			; STORE-NEXT: [[TMP6:%.*]] = or i64 [[TMP0]], 3
	; STORE-NEXT: [[TMP6:%.*]] = extractelement <4 x float> [[TMP5]], i32 0			; STORE-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds float, float [[ARRAY]], i64 [[TMP6]]
	; STORE-NEXT: [[TMP7:%.*]] = extractelement <4 x float> [[TMP5]], i32 1			; STORE-NEXT: [[TMP7:%.]] = load float, float [[ARRAYIDX12]], align 4
	; STORE-NEXT: [[TMP8:%.*]] = extractelement <4 x float> [[TMP5]], i32 2
	; STORE-NEXT: [[TMP9:%.*]] = extractelement <4 x float> [[TMP5]], i32 3
	; STORE-NEXT: br i1 [[CMP1495]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16_LR_PH:%.*]]			; STORE-NEXT: br i1 [[CMP1495]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16_LR_PH:%.*]]
	; STORE: for.body16.lr.ph:			; STORE: for.body16.lr.ph:
	; STORE-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds float, float [[ARG_A:%.*]], i64 [[INDVARS_IV]]			; STORE-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds float, float [[ARG_A:%.*]], i64 [[INDVARS_IV]]
	; STORE-NEXT: [[TMP10:%.]] = load float, float [[ADD_PTR]], align 4			; STORE-NEXT: [[TMP8:%.]] = load float, float [[ADD_PTR]], align 4
	; STORE-NEXT: br label [[FOR_BODY16:%.*]]			; STORE-NEXT: br label [[FOR_BODY16:%.*]]
	; STORE: for.cond.cleanup15:			; STORE: for.cond.cleanup15:
	; STORE-NEXT: [[W2_0_LCSSA:%.]] = phi float [ [[TMP8]], [[FOR_BODY]] ], [ [[TMP21:%.]], [[FOR_BODY16]] ]			; STORE-NEXT: [[W2_0_LCSSA:%.]] = phi float [ [[TMP5]], [[FOR_BODY]] ], [ [[SUB28:%.]], [[FOR_BODY16]] ]
	; STORE-NEXT: [[W3_0_LCSSA:%.]] = phi float [ [[TMP9]], [[FOR_BODY]] ], [ [[TMP25:%.]], [[FOR_BODY16]] ]			; STORE-NEXT: [[W3_0_LCSSA:%.]] = phi float [ [[TMP7]], [[FOR_BODY]] ], [ [[W2_096:%.]], [[FOR_BODY16]] ]
	; STORE-NEXT: [[W1_0_LCSSA:%.]] = phi float [ [[TMP7]], [[FOR_BODY]] ], [ [[TMP12:%.]], [[FOR_BODY16]] ]			; STORE-NEXT: [[W1_0_LCSSA:%.]] = phi float [ [[TMP3]], [[FOR_BODY]] ], [ [[W0_0100:%.]], [[FOR_BODY16]] ]
	; STORE-NEXT: [[W0_0_LCSSA:%.]] = phi float [ [[TMP6]], [[FOR_BODY]] ], [ [[SUB19:%.]], [[FOR_BODY16]] ]			; STORE-NEXT: [[W0_0_LCSSA:%.]] = phi float [ [[TMP1]], [[FOR_BODY]] ], [ [[SUB19:%.]], [[FOR_BODY16]] ]
	; STORE-NEXT: store float [[W0_0_LCSSA]], float* [[ARRAYIDX]], align 4			; STORE-NEXT: store float [[W0_0_LCSSA]], float* [[ARRAYIDX]], align 4
	; STORE-NEXT: store float [[W1_0_LCSSA]], float* [[ARRAYIDX4]], align 4			; STORE-NEXT: store float [[W1_0_LCSSA]], float* [[ARRAYIDX4]], align 4
	; STORE-NEXT: store float [[W2_0_LCSSA]], float* [[ARRAYIDX8]], align 4			; STORE-NEXT: store float [[W2_0_LCSSA]], float* [[ARRAYIDX8]], align 4
	; STORE-NEXT: store float [[W3_0_LCSSA]], float* [[ARRAYIDX12]], align 4			; STORE-NEXT: store float [[W3_0_LCSSA]], float* [[ARRAYIDX12]], align 4
	; STORE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; STORE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; STORE-NEXT: [[EXITCOND109:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 6			; STORE-NEXT: [[EXITCOND109:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 6
	; STORE-NEXT: br i1 [[EXITCOND109]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]			; STORE-NEXT: br i1 [[EXITCOND109]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
	; STORE: for.body16:			; STORE: for.body16:
				; STORE-NEXT: [[W0_0100]] = phi float [ [[TMP1]], [[FOR_BODY16_LR_PH]] ], [ [[SUB19]], [[FOR_BODY16]] ]
				; STORE-NEXT: [[W1_099:%.*]] = phi float [ [[TMP3]], [[FOR_BODY16_LR_PH]] ], [ [[W0_0100]], [[FOR_BODY16]] ]
	; STORE-NEXT: [[J_098:%.]] = phi i32 [ 0, [[FOR_BODY16_LR_PH]] ], [ [[INC:%.]], [[FOR_BODY16]] ]			; STORE-NEXT: [[J_098:%.]] = phi i32 [ 0, [[FOR_BODY16_LR_PH]] ], [ [[INC:%.]], [[FOR_BODY16]] ]
	; STORE-NEXT: [[TMP11:%.]] = phi <4 x float> [ [[TMP5]], [[FOR_BODY16_LR_PH]] ], [ [[TMP26:%.]], [[FOR_BODY16]] ]			; STORE-NEXT: [[W3_097:%.*]] = phi float [ [[TMP7]], [[FOR_BODY16_LR_PH]] ], [ [[W2_096]], [[FOR_BODY16]] ]
	; STORE-NEXT: [[TMP12]] = extractelement <4 x float> [[TMP11]], i32 0			; STORE-NEXT: [[W2_096]] = phi float [ [[TMP5]], [[FOR_BODY16_LR_PH]] ], [ [[SUB28]], [[FOR_BODY16]] ]
	; STORE-NEXT: [[TMP13:%.*]] = extractelement <4 x float> [[TMP11]], i32 1			; STORE-NEXT: [[MUL17:%.*]] = fmul fast float [[W0_0100]], 0x3FF19999A0000000
	; STORE-NEXT: [[TMP14:%.*]] = insertelement <2 x float> undef, float [[TMP12]], i32 0			; STORE-NEXT: [[MUL18_NEG:%.*]] = fmul fast float [[W1_099]], 0xBFF3333340000000
	; STORE-NEXT: [[TMP15:%.*]] = insertelement <2 x float> [[TMP14]], float [[TMP13]], i32 1			; STORE-NEXT: [[SUB92:%.*]] = fadd fast float [[MUL17]], [[MUL18_NEG]]
	; STORE-NEXT: [[TMP16:%.*]] = fmul fast <2 x float> <float 0x3FF19999A0000000, float 0xBFF3333340000000>, [[TMP15]]			; STORE-NEXT: [[SUB19]] = fadd fast float [[SUB92]], [[TMP8]]
	; STORE-NEXT: [[TMP17:%.*]] = extractelement <2 x float> [[TMP16]], i32 0
	; STORE-NEXT: [[TMP18:%.*]] = extractelement <2 x float> [[TMP16]], i32 1
	; STORE-NEXT: [[SUB92:%.*]] = fadd fast float [[TMP17]], [[TMP18]]
	; STORE-NEXT: [[SUB19]] = fadd fast float [[SUB92]], [[TMP10]]
	; STORE-NEXT: [[MUL20:%.*]] = fmul fast float [[SUB19]], 0x4000CCCCC0000000			; STORE-NEXT: [[MUL20:%.*]] = fmul fast float [[SUB19]], 0x4000CCCCC0000000
	; STORE-NEXT: [[TMP19:%.*]] = fmul fast <4 x float> <float 0xC0019999A0000000, float 0x4002666660000000, float 0x4008CCCCC0000000, float 0xC0099999A0000000>, [[TMP11]]			; STORE-NEXT: [[MUL21_NEG:%.*]] = fmul fast float [[W0_0100]], 0xC0019999A0000000
	; STORE-NEXT: [[ADD2293:%.*]] = fadd fast float undef, undef			; STORE-NEXT: [[MUL23:%.*]] = fmul fast float [[W1_099]], 0x4002666660000000
	; STORE-NEXT: [[ADD24:%.*]] = fadd fast float [[ADD2293]], undef			; STORE-NEXT: [[MUL25:%.*]] = fmul fast float [[W2_096]], 0x4008CCCCC0000000
	; STORE-NEXT: [[SUB2694:%.*]] = fadd fast float [[ADD24]], undef			; STORE-NEXT: [[MUL27_NEG:%.*]] = fmul fast float [[W3_097]], 0xC0099999A0000000
	; STORE-NEXT: [[RDX_SHUF:%.*]] = shufflevector <4 x float> [[TMP19]], <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>			; STORE-NEXT: [[ADD2293:%.*]] = fadd fast float [[MUL27_NEG]], [[MUL25]]
	; STORE-NEXT: [[BIN_RDX:%.*]] = fadd fast <4 x float> [[TMP19]], [[RDX_SHUF]]			; STORE-NEXT: [[ADD24:%.*]] = fadd fast float [[ADD2293]], [[MUL23]]
	; STORE-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			; STORE-NEXT: [[SUB2694:%.*]] = fadd fast float [[ADD24]], [[MUL21_NEG]]
	; STORE-NEXT: [[BIN_RDX2:%.*]] = fadd fast <4 x float> [[BIN_RDX]], [[RDX_SHUF1]]			; STORE-NEXT: [[SUB28]] = fadd fast float [[SUB2694]], [[MUL20]]
	; STORE-NEXT: [[TMP20:%.*]] = extractelement <4 x float> [[BIN_RDX2]], i32 0
	; STORE-NEXT: [[TMP21]] = fadd fast float [[TMP20]], [[MUL20]]
	; STORE-NEXT: [[SUB28:%.*]] = fadd fast float [[SUB2694]], [[MUL20]]
	; STORE-NEXT: [[INC]] = add nuw i32 [[J_098]], 1			; STORE-NEXT: [[INC]] = add nuw i32 [[J_098]], 1
	; STORE-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[ARG_B]]			; STORE-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[ARG_B]]
	; STORE-NEXT: [[TMP22:%.*]] = insertelement <4 x float> undef, float [[SUB19]], i32 0
	; STORE-NEXT: [[TMP23:%.*]] = insertelement <4 x float> [[TMP22]], float [[TMP12]], i32 1
	; STORE-NEXT: [[TMP24:%.*]] = insertelement <4 x float> [[TMP23]], float [[TMP21]], i32 2
	; STORE-NEXT: [[TMP25]] = extractelement <4 x float> [[TMP11]], i32 2
	; STORE-NEXT: [[TMP26]] = insertelement <4 x float> [[TMP24]], float [[TMP25]], i32 3
	; STORE-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16]]			; STORE-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP15]], label [[FOR_BODY16]]
	;			;
	entry:			entry:
	%cmp1495 = icmp eq i32 %arg_B, 0			%cmp1495 = icmp eq i32 %arg_B, 0
	br label %for.body			br label %for.body

	for.cond.cleanup: ; preds = %for.cond.cleanup15			for.cond.cleanup: ; preds = %for.cond.cleanup15
	ret void			ret void
	▲ Show 20 Lines • Show All 1,039 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/reorder_phi.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=x86_64-unknown -mcpu=corei7-avx \| FileCheck %s			; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=x86_64-unknown -mcpu=corei7-avx \| FileCheck %s

	%struct.complex = type { float, float }			%struct.complex = type { float, float }

	define void @foo (%struct.complex* %A, %struct.complex* %B, %struct.complex* %Result) {			define void @foo (%struct.complex* %A, %struct.complex* %B, %struct.complex* %Result) {
	; CHECK-LABEL: @foo(			; CHECK-LABEL: @foo(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 256, 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 256, 0
	; CHECK-NEXT: br label [[LOOP:%.*]]			; CHECK-NEXT: br label [[LOOP:%.*]]
	; CHECK: loop:			; CHECK: loop:
	; CHECK-NEXT: [[TMP1:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[TMP25:%.*]], [[LOOP]] ]			; CHECK-NEXT: [[TMP1:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[TMP20:%.*]], [[LOOP]] ]
	; CHECK-NEXT: [[TMP2:%.]] = phi <2 x float> [ zeroinitializer, [[ENTRY]] ], [ [[TMP24:%.]], [[LOOP]] ]			; CHECK-NEXT: [[TMP2:%.]] = phi float [ 0.000000e+00, [[ENTRY]] ], [ [[TMP19:%.]], [[LOOP]] ]
	; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds [[STRUCT_COMPLEX:%.]], %struct.complex* [[A:%.*]], i64 [[TMP1]], i32 0			; CHECK-NEXT: [[TMP3:%.]] = phi float [ 0.000000e+00, [[ENTRY]] ], [ [[TMP18:%.]], [[LOOP]] ]
	; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[A]], i64 [[TMP1]], i32 1			; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds [[STRUCT_COMPLEX:%.]], %struct.complex* [[A:%.*]], i64 [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP5:%.]] = bitcast float [[TMP3]] to <2 x float>*			; CHECK-NEXT: [[TMP5:%.]] = load float, float [[TMP4]], align 4
	; CHECK-NEXT: [[TMP6:%.]] = load <2 x float>, <2 x float> [[TMP5]], align 4			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[A]], i64 [[TMP1]], i32 1
	; CHECK-NEXT: [[TMP7:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[B:%.*]], i64 [[TMP1]], i32 0			; CHECK-NEXT: [[TMP7:%.]] = load float, float [[TMP6]], align 4
	; CHECK-NEXT: [[TMP8:%.]] = load float, float [[TMP7]], align 4			; CHECK-NEXT: [[TMP8:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[B:%.*]], i64 [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[B]], i64 [[TMP1]], i32 1			; CHECK-NEXT: [[TMP9:%.]] = load float, float [[TMP8]], align 4
	; CHECK-NEXT: [[TMP10:%.]] = load float, float [[TMP9]], align 4			; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[B]], i64 [[TMP1]], i32 1
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <2 x float> undef, float [[TMP8]], i32 0			; CHECK-NEXT: [[TMP11:%.]] = load float, float [[TMP10]], align 4
	; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x float> [[TMP11]], float [[TMP8]], i32 1			; CHECK-NEXT: [[TMP12:%.*]] = fmul float [[TMP5]], [[TMP9]]
	; CHECK-NEXT: [[TMP13:%.*]] = fmul <2 x float> [[TMP6]], [[TMP12]]			; CHECK-NEXT: [[TMP13:%.*]] = fmul float [[TMP7]], [[TMP11]]
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x float> [[TMP6]], i32 1			; CHECK-NEXT: [[TMP14:%.*]] = fsub float [[TMP12]], [[TMP13]]
	; CHECK-NEXT: [[TMP15:%.*]] = insertelement <2 x float> undef, float [[TMP14]], i32 0			; CHECK-NEXT: [[TMP15:%.*]] = fmul float [[TMP7]], [[TMP9]]
	; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x float> [[TMP6]], i32 0			; CHECK-NEXT: [[TMP16:%.*]] = fmul float [[TMP5]], [[TMP11]]
	; CHECK-NEXT: [[TMP17:%.*]] = insertelement <2 x float> [[TMP15]], float [[TMP16]], i32 1			; CHECK-NEXT: [[TMP17:%.*]] = fadd float [[TMP15]], [[TMP16]]
	; CHECK-NEXT: [[TMP18:%.*]] = insertelement <2 x float> undef, float [[TMP10]], i32 0			; CHECK-NEXT: [[TMP18]] = fadd float [[TMP3]], [[TMP14]]
	; CHECK-NEXT: [[TMP19:%.*]] = insertelement <2 x float> [[TMP18]], float [[TMP10]], i32 1			; CHECK-NEXT: [[TMP19]] = fadd float [[TMP2]], [[TMP17]]
	; CHECK-NEXT: [[TMP20:%.*]] = fmul <2 x float> [[TMP17]], [[TMP19]]			; CHECK-NEXT: [[TMP20]] = add nuw nsw i64 [[TMP1]], 1
	; CHECK-NEXT: [[TMP21:%.*]] = fsub <2 x float> [[TMP13]], [[TMP20]]			; CHECK-NEXT: [[TMP21:%.*]] = icmp eq i64 [[TMP20]], [[TMP0]]
	; CHECK-NEXT: [[TMP22:%.*]] = fadd <2 x float> [[TMP13]], [[TMP20]]			; CHECK-NEXT: br i1 [[TMP21]], label [[EXIT:%.*]], label [[LOOP]]
	; CHECK-NEXT: [[TMP23:%.*]] = shufflevector <2 x float> [[TMP21]], <2 x float> [[TMP22]], <2 x i32> <i32 0, i32 3>
	; CHECK-NEXT: [[TMP24]] = fadd <2 x float> [[TMP2]], [[TMP23]]
	; CHECK-NEXT: [[TMP25]] = add nuw nsw i64 [[TMP1]], 1
	; CHECK-NEXT: [[TMP26:%.*]] = icmp eq i64 [[TMP25]], [[TMP0]]
	; CHECK-NEXT: br i1 [[TMP26]], label [[EXIT:%.*]], label [[LOOP]]
	; CHECK: exit:			; CHECK: exit:
	; CHECK-NEXT: [[TMP27:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[RESULT:%.*]], i32 0, i32 0			; CHECK-NEXT: [[TMP22:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[RESULT:%.*]], i32 0, i32 0
	; CHECK-NEXT: [[TMP28:%.*]] = extractelement <2 x float> [[TMP24]], i32 0			; CHECK-NEXT: store float [[TMP18]], float* [[TMP22]], align 4
	; CHECK-NEXT: store float [[TMP28]], float* [[TMP27]], align 4			; CHECK-NEXT: [[TMP23:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[RESULT]], i32 0, i32 1
	; CHECK-NEXT: [[TMP29:%.]] = getelementptr inbounds [[STRUCT_COMPLEX]], %struct.complex [[RESULT]], i32 0, i32 1			; CHECK-NEXT: store float [[TMP19]], float* [[TMP23]], align 4
	; CHECK-NEXT: [[TMP30:%.*]] = extractelement <2 x float> [[TMP24]], i32 1
	; CHECK-NEXT: store float [[TMP30]], float* [[TMP29]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%0 = add i64 256, 0			%0 = add i64 256, 0
	br label %loop			br label %loop

	loop:			loop:
	%1 = phi i64 [ 0, %entry ], [ %20, %loop ]			%1 = phi i64 [ 0, %entry ], [ %20, %loop ]
	Show All 30 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/simplebb.ll

Show First 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	;
ret void		ret void
}		}

; Don't vectorize volatile loads.		; Don't vectorize volatile loads.
define void @test_volatile_load(double* %a, double* %b, double* %c) {		define void @test_volatile_load(double* %a, double* %b, double* %c) {
; CHECK-LABEL: @test_volatile_load(		; CHECK-LABEL: @test_volatile_load(
; CHECK-NEXT: [[I0:%.]] = load volatile double, double [[A:%.*]], align 8		; CHECK-NEXT: [[I0:%.]] = load volatile double, double [[A:%.*]], align 8
; CHECK-NEXT: [[I1:%.]] = load volatile double, double [[B:%.*]], align 8		; CHECK-NEXT: [[I1:%.]] = load volatile double, double [[B:%.*]], align 8
		; CHECK-NEXT: [[MUL:%.*]] = fmul double [[I0]], [[I1]]
; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[A]], i64 1		; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds double, double [[A]], i64 1
; CHECK-NEXT: [[I3:%.]] = load double, double [[ARRAYIDX3]], align 8		; CHECK-NEXT: [[I3:%.]] = load double, double [[ARRAYIDX3]], align 8
; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds double, double [[B]], i64 1		; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds double, double [[B]], i64 1
; CHECK-NEXT: [[I4:%.]] = load double, double [[ARRAYIDX4]], align 8		; CHECK-NEXT: [[I4:%.]] = load double, double [[ARRAYIDX4]], align 8
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> undef, double [[I0]], i32 0		; CHECK-NEXT: [[MUL5:%.*]] = fmul double [[I3]], [[I4]]
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[I3]], i32 1		; CHECK-NEXT: store double [[MUL]], double* [[C:%.*]], align 8
; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> undef, double [[I1]], i32 0		; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[C]], i64 1
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> [[TMP3]], double [[I4]], i32 1		; CHECK-NEXT: store double [[MUL5]], double* [[ARRAYIDX5]], align 8
; CHECK-NEXT: [[TMP5:%.*]] = fmul <2 x double> [[TMP2]], [[TMP4]]
; CHECK-NEXT: [[TMP6:%.]] = bitcast double [[C:%.]] to <2 x double>
; CHECK-NEXT: store <2 x double> [[TMP5]], <2 x double>* [[TMP6]], align 8
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%i0 = load volatile double, double* %a, align 8		%i0 = load volatile double, double* %a, align 8
%i1 = load volatile double, double* %b, align 8		%i1 = load volatile double, double* %b, align 8
%mul = fmul double %i0, %i1		%mul = fmul double %i0, %i1
%arrayidx3 = getelementptr inbounds double, double* %a, i64 1		%arrayidx3 = getelementptr inbounds double, double* %a, i64 1
%i3 = load double, double* %arrayidx3, align 8		%i3 = load double, double* %arrayidx3, align 8
%arrayidx4 = getelementptr inbounds double, double* %b, i64 1		%arrayidx4 = getelementptr inbounds double, double* %b, i64 1
Show All 38 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[TTI CostModel] change default cost of FP ops to 1 (PR36280)AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 134925

llvm/trunk/include/llvm/CodeGen/BasicTTIImpl.h

llvm/trunk/test/Analysis/CostModel/X86/arith-fp.ll

llvm/trunk/test/Analysis/CostModel/X86/intrinsic-cost.ll

llvm/trunk/test/Analysis/CostModel/X86/reduction.ll

llvm/trunk/test/Transforms/LoopVectorize/X86/imprecise-through-phis.ll

llvm/trunk/test/Transforms/SLPVectorizer/AArch64/remarks.ll

llvm/trunk/test/Transforms/SLPVectorizer/X86/PR36280.ll

llvm/trunk/test/Transforms/SLPVectorizer/X86/cse.ll

llvm/trunk/test/Transforms/SLPVectorizer/X86/horizontal.ll

llvm/trunk/test/Transforms/SLPVectorizer/X86/reorder_phi.ll

llvm/trunk/test/Transforms/SLPVectorizer/X86/simplebb.ll

[TTI CostModel] change default cost of FP ops to 1 (PR36280)
AbandonedPublic