This is an archive of the discontinued LLVM Phabricator instance.

Improved cost model for FDIV and FSQRT
ClosedPublic

Authored by avt77 on Oct 18 2016, 5:01 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
ABataev
mkuper

Commits

rGd07c731d86d1: Improved cost model for FDIV and FSQRT, by Andrew Tischenko
rL285564: Improved cost model for FDIV and FSQRT, by Andrew Tischenko

Summary

There is a bug describing poor cost model for floating point operations: Bug 29083 - [X86][SSE] Improve costs for floating point operations. This patch is the second one in series of patches dealing with cost model.

Diff Detail

Repository: rL LLVM

Event Timeline

avt77 updated this revision to Diff 74981.Oct 18 2016, 5:01 AM

avt77 retitled this revision from to Improved cost model for FDIV and FSQRT.

avt77 updated this object.

avt77 added reviewers: RKSimon, spatel, ABataev.

I updated cost numbers corresponding to Simon requirements

Cost model numbers related to Pentium and Nehalem were updated.

RKSimon added inline comments.Oct 20 2016, 10:19 AM

lib/Target/X86/X86TargetTransformInfo.cpp
273 ↗	(On Diff #75284)	AVXCustomCostTable was recently added - you can probably merge these into that table and avoid the extra lookup.
419 ↗	(On Diff #75284)	Put these above the SDIV/UDIV costs so it doesn't look like they are under the same 'avoid vectorize division' comment, which is purely an integer division thing.
474 ↗	(On Diff #75284)	if (ST->hasSSE())
1087 ↗	(On Diff #75284)	Move SSE1FloatCostTable below SSE2CostTbl to keep to 'descending subtarget' convention. You can probably rename it SSE1CostTble as well.
1088 ↗	(On Diff #75284)	Worth adding a SSE41CostTbl for Core2 era costs?
1175 ↗	(On Diff #75284)	if (ST->hasSSE())

RKSimon added a reviewer: mkuper.Oct 20 2016, 10:20 AM

RKSimon added a subscriber: llvm-commits.

mkuper added inline comments.Oct 20 2016, 11:00 AM

lib/Target/X86/X86TargetTransformInfo.cpp
269 ↗	(On Diff #75284)	A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd. I'd expect "2, maybe a bit more", and Agner seems to agree - e.g. for Sandybridge he gives a range of 10-14 cycles for XMM DIVPS, and 20-28 cycles for YMM VDIVPS. Were your IACA checks accounting for additional instructions? Or is this an inconsistency between IACA and Anger's tests? (Note that these numbers are supposed to represent reciprocal throughput, but Agner's data for latency also has factor of ~2)

avt77 added inline comments.Oct 25 2016, 1:12 AM

lib/Target/X86/X86TargetTransformInfo.cpp

269 ↗

(On Diff #75284)

I use the following numbers:
atischenko@ip-172-31-21-62:~/iaca-lin64/bin$ ./test-arith-fdiv.sh
*********SandyBridge*********
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - arith-fdiv.o
Binary Format - 64Bit
Architecture - SNB
Analysis Type - Throughput

Throughput Analysis Report

Block Throughput: 14.00 Cycles Throughput Bottleneck: Divider

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 1.0 14.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 0.0 |

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred

- instruction micro-ops not bound to a port

^ - Micro Fusion happened

- ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

Num Of	Ports pressure in cycles
Uops	0 - DV	1	2 - D	3 - D	4	5

1.0 14.0

vdivps xmm0, xmm0, xmm0

Total Num Of Uops: 1

atischenko@ip-172-31-21-62:~/iaca-lin64/bin$
./test-arith-fdiv.sh
*********SandyBridge*********
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - arith-fdiv.o
Binary Format - 64Bit
Architecture - SNB
Analysis Type - Throughput

Throughput Analysis Report

Block Throughput: 41.00 Cycles Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 2.0 28.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 |

- instruction micro-ops not bound to a port

^ - Micro Fusion happened

- ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

Num Of	Ports pressure in cycles
Uops	0 - DV	1	2 - D	3 - D	4	5

2.0 28.0

1.0

vdivps ymm0, ymm0, ymm0

Total Num Of Uops: 3

If we use DV value then it's about 2 times. But I used "Block Throughput" value above. As you see my block is exactly one instruction that's why I decided to use "Block Throughput". Maybe I'm wrong? Can anyone suggest me the right decision?

I changed everything except the issue with ymm FDIV. As I wrote in my answer on the comment I'm using Block Throughput from IACA tool. If I'm wrong please say me about and I'll recollect all numbers.

RKSimon added inline comments.Oct 25 2016, 4:28 AM

lib/Target/X86/X86TargetTransformInfo.cpp
1088 ↗	(On Diff #75284)	Please add Nehalem costs (from Agner) - they're notably better than the P4 default: FSQRT f32/4f32 : 18 f64/2f64 : 32

I'm swaying toward using Agner's numbers for SandyBridge FDIV / FSQRT costs - would anyone have any objections?

In D25722#578473, @RKSimon wrote:

I'm swaying toward using Agner's numbers for SandyBridge FDIV / FSQRT costs - would anyone have any objections?

I'd really like to understand the discrepancy between Agner's numbers and the IACA numbers - especially since IACA does give the expected number for the latency.
Andrew, any chance you could ping the IACA people at Intel and ask them about this?

Meanwhile, I think I'd also prefer this to go in with Agner's numbers - and change it to the IACA numbers later if it turns out it's accounting for something real.

In D25722#578664, @mkuper wrote:

I'd really like to understand the discrepancy between Agner's numbers and the IACA numbers - especially since IACA does give the expected number for the latency.
Andrew, any chance you could ping the IACA people at Intel and ask them about this?

IACA doesn't have that great support these days - the website says development has been suspended, although they have announced they will do a BDW/SKL bugfix release at some point. Similar comments about dodgy costs don't seem to be answered.

In D25722#578712, @RKSimon wrote:

In D25722#578664, @mkuper wrote:

I'd really like to understand the discrepancy between Agner's numbers and the IACA numbers - especially since IACA does give the expected number for the latency.
Andrew, any chance you could ping the IACA people at Intel and ask them about this?

IACA doesn't have that great support these days - the website says development has been suspended, although they have announced they will do a BDW/SKL bugfix release at some point. Similar comments about dodgy costs don't seem to be answered.

To the best of my knowledge, Andrew works for Intel, so he may have a better chance of getting an answer. :-)
As to support - the website has a comment from July that states that "[they] are resuming support for Intel(R) Architecture Code Analyzer with BDW and SKL support probably before end of 2016".

So, one can hope...

jrmuizel added a subscriber: jrmuizel.Oct 25 2016, 1:42 PM

If I understood correctly I should replace all IACA numbers with Agner's numbers, right? OK, I'll do it.
JFYI, I'm not working in Intel since July but of course I know a lot of guys from Intel and I'll try to ask them about IACA future.

avt77 added inline comments.Oct 27 2016, 6:35 AM

lib/Target/X86/X86TargetTransformInfo.cpp
269 ↗	(On Diff #75284)	Maybe I understood the difference between Anger's and IACA numbers (for SandyBridge at least): Anger's table has numbers for ymm only : VDIVPS y,y,y 3 3 2 1 21-29 20-28 AVX VDIVPS y,y,m256 4 3 2 1 1+ 20-28 AVX but when I played with IACA it showed different types of operands: xmm0, ... and ymm0, ....
269 ↗	(On Diff #75284)	And another comment: for f32/v4f32 IACA shows 14 for vdivss/vdivps but Agner's table shows the same 10-14 for divss/divps and 20-28 for vdivss/vdivps. Clang generates vdivss/vdivps (for SNB target) that's why I should select 28 as a cost value is we're using Agner's numbers. Is it OK?
269 ↗	(On Diff #75284)	As an intermediate decision I did the following for SNB numbers: xmm operands use the first number as a cost while ymm operands use the second number. For example: { ISD::FDIV, MVT::f32, 20 }, // SNB from http://www.agner.org/ (IACA: 14) { ISD::FDIV, MVT::v4f32, 20 }, // SNB from http://www.agner.org/ (IACA: 14) { ISD::FDIV, MVT::v8f32, 28 }, // SNB from http://www.agner.org/ (IACA: 41) Is it acceptable? In fact we need here some extension of our current Cost tables: it is not enough to know the instruction itself and the type of its operands. We need some target dependent info as well (e.g. if we're speaking about X86 targets only (exactly our case) then we could add the size of the target registers or something similar, e.g. ISA: SNB/HSW could use both divps and vdivps and usage of 2 divps could be better then usage of one vdivps but our current Cost Model cannot help to decide what's better).
269 ↗	(On Diff #75284)	BTW, I've just realized that Haswell IACA numbers are really closed to the expected ones: f32 13 vdivss v4f32 13 vdivps xmm0... v8f32 26 vdivps ymm0... f64 20 vdivsd v2f64 20 vdivpd v4f62 47 vdivpd ymm0... It means we have problems with IACA numbers for SNB only (too big difference for xmm and ymm). Maybe new CPUs simply resolved this issue?
1088 ↗	(On Diff #75284)	JFYI, I got the same numbers for Nehalem with IACA

All numbers from IACA were replaced with Agner's numbers

The wrong SNB numbers were fixed (tnx to Simon Pilgrim)

I think this is almost done now - please can you add the AVX2/Haswell costs:

Haswell
FDIV f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28
FSQRT f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28

Other than that does anyone have any additional feedback?

In D25722#582086, @RKSimon wrote:

I think this is almost done now - please can you add the AVX2/Haswell costs:

Haswell
FDIV f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28
FSQRT f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28

Other than that does anyone have any additional feedback?

No, LGTM (modulo what Simon said about Haswell)

Thanks for fixing this, Andrew.

Haswell numbers added for AVX2

LGTM with one minor

lib/Target/X86/X86TargetTransformInfo.cpp
1227 ↗	(On Diff #76295)	Better to use the Pentium III costs F32 = 28, VF432 = 56

This revision is now accepted and ready to land.Oct 29 2016, 4:36 AM

FSQRT changes: SSE1 Cost table updated with Pentium III numbers; SSE42 cost table added with Nehalem numbers

Closed by commit rL285564: Improved cost model for FDIV and FSQRT, by Andrew Tischenko (authored by ABataev). · Explain WhyOct 31 2016, 5:20 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86TargetTransformInfo.cpp

76 lines

test/

Analysis/

CostModel/

X86/

arith-fp.ll

164 lines

Diff 76387

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 318 Lines • ▼ Show 20 Lines	static const CostTblEntry AVX2CustomCostTable[] = {

{ ISD::SRL, MVT::v32i8, 11 }, // vpblendvb sequence.		{ ISD::SRL, MVT::v32i8, 11 }, // vpblendvb sequence.
{ ISD::SRL, MVT::v16i16, 10 }, // extend/vpsrlvd/pack sequence.		{ ISD::SRL, MVT::v16i16, 10 }, // extend/vpsrlvd/pack sequence.

{ ISD::SRA, MVT::v32i8, 24 }, // vpblendvb sequence.		{ ISD::SRA, MVT::v32i8, 24 }, // vpblendvb sequence.
{ ISD::SRA, MVT::v16i16, 10 }, // extend/vpsravd/pack sequence.		{ ISD::SRA, MVT::v16i16, 10 }, // extend/vpsravd/pack sequence.
{ ISD::SRA, MVT::v2i64, 4 }, // srl/xor/sub sequence.		{ ISD::SRA, MVT::v2i64, 4 }, // srl/xor/sub sequence.
{ ISD::SRA, MVT::v4i64, 4 }, // srl/xor/sub sequence.		{ ISD::SRA, MVT::v4i64, 4 }, // srl/xor/sub sequence.
		{ ISD::FDIV, MVT::f32, 7 }, // Haswell from http://www.agner.org/
		{ ISD::FDIV, MVT::v4f32, 7 }, // Haswell from http://www.agner.org/
		{ ISD::FDIV, MVT::v8f32, 14 }, // Haswell from http://www.agner.org/
		{ ISD::FDIV, MVT::f64, 14 }, // Haswell from http://www.agner.org/
		{ ISD::FDIV, MVT::v2f64, 14 }, // Haswell from http://www.agner.org/
		{ ISD::FDIV, MVT::v4f64, 28 }, // Haswell from http://www.agner.org/
};		};

// Look for AVX2 lowering tricks for custom cases.		// Look for AVX2 lowering tricks for custom cases.
if (ST->hasAVX2()) {		if (ST->hasAVX2()) {
if (const auto *Entry = CostTableLookup(AVX2CustomCostTable, ISD,		if (const auto *Entry = CostTableLookup(AVX2CustomCostTable, ISD,
LT.second))		LT.second))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;
}		}

static const CostTblEntry AVXCustomCostTable[] = {		static const CostTblEntry AVXCustomCostTable[] = {
		{ ISD::FDIV, MVT::f32, 14 }, // SNB from http://www.agner.org/
		{ ISD::FDIV, MVT::v4f32, 14 }, // SNB from http://www.agner.org/
		{ ISD::FDIV, MVT::v8f32, 28 }, // SNB from http://www.agner.org/
		{ ISD::FDIV, MVT::f64, 22 }, // SNB from http://www.agner.org/
		{ ISD::FDIV, MVT::v2f64, 22 }, // SNB from http://www.agner.org/
		{ ISD::FDIV, MVT::v4f64, 44 }, // SNB from http://www.agner.org/
// Vectorizing division is a bad idea. See the SSE2 table for more comments.		// Vectorizing division is a bad idea. See the SSE2 table for more comments.
{ ISD::SDIV, MVT::v32i8, 32*20 },		{ ISD::SDIV, MVT::v32i8, 32*20 },
{ ISD::SDIV, MVT::v16i16, 16*20 },		{ ISD::SDIV, MVT::v16i16, 16*20 },
{ ISD::SDIV, MVT::v8i32, 8*20 },		{ ISD::SDIV, MVT::v8i32, 8*20 },
{ ISD::SDIV, MVT::v4i64, 4*20 },		{ ISD::SDIV, MVT::v4i64, 4*20 },
{ ISD::UDIV, MVT::v32i8, 32*20 },		{ ISD::UDIV, MVT::v32i8, 32*20 },
{ ISD::UDIV, MVT::v16i16, 16*20 },		{ ISD::UDIV, MVT::v16i16, 16*20 },
{ ISD::UDIV, MVT::v8i32, 8*20 },		{ ISD::UDIV, MVT::v8i32, 8*20 },
{ ISD::UDIV, MVT::v4i64, 4*20 },		{ ISD::UDIV, MVT::v4i64, 4*20 },
};		};

// Look for AVX2 lowering tricks for custom cases.		// Look for AVX2 lowering tricks for custom cases.
if (ST->hasAVX()) {		if (ST->hasAVX()) {
if (const auto *Entry = CostTableLookup(AVXCustomCostTable, ISD,		if (const auto *Entry = CostTableLookup(AVXCustomCostTable, ISD,
LT.second))		LT.second))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;
}		}

		static const CostTblEntry SSE42FloatCostTable[] = {
		{ ISD::FDIV, MVT::f32, 14 }, // Nehalem from http://www.agner.org/
		{ ISD::FDIV, MVT::v4f32, 14 }, // Nehalem from http://www.agner.org/
		{ ISD::FDIV, MVT::f64, 22 }, // Nehalem from http://www.agner.org/
		{ ISD::FDIV, MVT::v2f64, 22 }, // Nehalem from http://www.agner.org/
		};

		if (ST->hasSSE42()) {
		if (const auto *Entry = CostTableLookup(SSE42FloatCostTable, ISD,
		LT.second))
		return LT.first * Entry->Cost;
		}

static const CostTblEntry		static const CostTblEntry
SSE2UniformCostTable[] = {		SSE2UniformCostTable[] = {
// Uniform splats are cheaper for the following instructions.		// Uniform splats are cheaper for the following instructions.
{ ISD::SHL, MVT::v16i8, 1 }, // psllw.		{ ISD::SHL, MVT::v16i8, 1 }, // psllw.
{ ISD::SHL, MVT::v32i8, 2 }, // psllw.		{ ISD::SHL, MVT::v32i8, 2 }, // psllw.
{ ISD::SHL, MVT::v8i16, 1 }, // psllw.		{ ISD::SHL, MVT::v8i16, 1 }, // psllw.
{ ISD::SHL, MVT::v16i16, 2 }, // psllw.		{ ISD::SHL, MVT::v16i16, 2 }, // psllw.
{ ISD::SHL, MVT::v4i32, 1 }, // pslld		{ ISD::SHL, MVT::v4i32, 1 }, // pslld
▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	static const CostTblEntry SSE2CostTable[] = {
{ ISD::SRA, MVT::v32i8, 2*54 }, // unpacked cmpgtb sequence.		{ ISD::SRA, MVT::v32i8, 2*54 }, // unpacked cmpgtb sequence.
{ ISD::SRA, MVT::v8i16, 32 }, // cmpgtb sequence.		{ ISD::SRA, MVT::v8i16, 32 }, // cmpgtb sequence.
{ ISD::SRA, MVT::v16i16, 2*32 }, // cmpgtb sequence.		{ ISD::SRA, MVT::v16i16, 2*32 }, // cmpgtb sequence.
{ ISD::SRA, MVT::v4i32, 16 }, // Shift each lane + blend.		{ ISD::SRA, MVT::v4i32, 16 }, // Shift each lane + blend.
{ ISD::SRA, MVT::v8i32, 2*16 }, // Shift each lane + blend.		{ ISD::SRA, MVT::v8i32, 2*16 }, // Shift each lane + blend.
{ ISD::SRA, MVT::v2i64, 12 }, // srl/xor/sub sequence.		{ ISD::SRA, MVT::v2i64, 12 }, // srl/xor/sub sequence.
{ ISD::SRA, MVT::v4i64, 2*12 }, // srl/xor/sub sequence.		{ ISD::SRA, MVT::v4i64, 2*12 }, // srl/xor/sub sequence.

		{ ISD::FDIV, MVT::f32, 23 }, // Pentium IV from http://www.agner.org/
		{ ISD::FDIV, MVT::v4f32, 39 }, // Pentium IV from http://www.agner.org/
		{ ISD::FDIV, MVT::f64, 38 }, // Pentium IV from http://www.agner.org/
		{ ISD::FDIV, MVT::v2f64, 69 }, // Pentium IV from http://www.agner.org/

// It is not a good idea to vectorize division. We have to scalarize it and		// It is not a good idea to vectorize division. We have to scalarize it and
// in the process we will often end up having to spilling regular		// in the process we will often end up having to spilling regular
// registers. The overhead of division is going to dominate most kernels		// registers. The overhead of division is going to dominate most kernels
// anyways so try hard to prevent vectorization of division - it is		// anyways so try hard to prevent vectorization of division - it is
// generally a bad idea. Assume somewhat arbitrarily that we have to be able		// generally a bad idea. Assume somewhat arbitrarily that we have to be able
// to hide "20 cycles" for each lane.		// to hide "20 cycles" for each lane.
{ ISD::SDIV, MVT::v16i8, 16*20 },		{ ISD::SDIV, MVT::v16i8, 16*20 },
{ ISD::SDIV, MVT::v8i16, 8*20 },		{ ISD::SDIV, MVT::v8i16, 8*20 },
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	if (const auto *Entry = CostTableLookup(CustomLowered, ISD, LT.second))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;

// Special lowering of v4i32 mul on sse2, sse3: Lower v4i32 mul as 2x shuffle,		// Special lowering of v4i32 mul on sse2, sse3: Lower v4i32 mul as 2x shuffle,
// 2x pmuludq, 2x shuffle.		// 2x pmuludq, 2x shuffle.
if (ISD == ISD::MUL && LT.second == MVT::v4i32 && ST->hasSSE2() &&		if (ISD == ISD::MUL && LT.second == MVT::v4i32 && ST->hasSSE2() &&
!ST->hasSSE41())		!ST->hasSSE41())
return LT.first * 6;		return LT.first * 6;

		static const CostTblEntry SSE1FloatCostTable[] = {
		{ ISD::FDIV, MVT::f32, 17 }, // Pentium III from http://www.agner.org/
		{ ISD::FDIV, MVT::v4f32, 34 }, // Pentium III from http://www.agner.org/
		};

		if (ST->hasSSE1())
		if (const auto *Entry = CostTableLookup(SSE1FloatCostTable, ISD,
		LT.second))
		return LT.first * Entry->Cost;
// Fallback to the default implementation.		// Fallback to the default implementation.
return BaseT::getArithmeticInstrCost(Opcode, Ty, Op1Info, Op2Info);		return BaseT::getArithmeticInstrCost(Opcode, Ty, Op1Info, Op2Info);
}		}

int X86TTIImpl::getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,		int X86TTIImpl::getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,
Type *SubTp) {		Type *SubTp) {
// We only estimate the cost of reverse and alternate shuffles.		// We only estimate the cost of reverse and alternate shuffles.
if (Kind != TTI::SK_Reverse && Kind != TTI::SK_Alternate)		if (Kind != TTI::SK_Reverse && Kind != TTI::SK_Alternate)
▲ Show 20 Lines • Show All 562 Lines • ▼ Show 20 Lines	static const CostTblEntry AVX2CostTbl[] = {
{ ISD::CTLZ, MVT::v32i8, 9 },		{ ISD::CTLZ, MVT::v32i8, 9 },
{ ISD::CTPOP, MVT::v4i64, 7 },		{ ISD::CTPOP, MVT::v4i64, 7 },
{ ISD::CTPOP, MVT::v8i32, 11 },		{ ISD::CTPOP, MVT::v8i32, 11 },
{ ISD::CTPOP, MVT::v16i16, 9 },		{ ISD::CTPOP, MVT::v16i16, 9 },
{ ISD::CTPOP, MVT::v32i8, 6 },		{ ISD::CTPOP, MVT::v32i8, 6 },
{ ISD::CTTZ, MVT::v4i64, 10 },		{ ISD::CTTZ, MVT::v4i64, 10 },
{ ISD::CTTZ, MVT::v8i32, 14 },		{ ISD::CTTZ, MVT::v8i32, 14 },
{ ISD::CTTZ, MVT::v16i16, 12 },		{ ISD::CTTZ, MVT::v16i16, 12 },
{ ISD::CTTZ, MVT::v32i8, 9 }		{ ISD::CTTZ, MVT::v32i8, 9 },
		{ ISD::FSQRT, MVT::f32, 7 }, // Haswell from http://www.agner.org/
		{ ISD::FSQRT, MVT::v4f32, 7 }, // Haswell from http://www.agner.org/
		{ ISD::FSQRT, MVT::v8f32, 14 }, // Haswell from http://www.agner.org/
		{ ISD::FSQRT, MVT::f64, 14 }, // Haswell from http://www.agner.org/
		{ ISD::FSQRT, MVT::v2f64, 14 }, // Haswell from http://www.agner.org/
		{ ISD::FSQRT, MVT::v4f64, 28 }, // Haswell from http://www.agner.org/
};		};
static const CostTblEntry AVX1CostTbl[] = {		static const CostTblEntry AVX1CostTbl[] = {
{ ISD::BITREVERSE, MVT::v4i64, 10 },		{ ISD::BITREVERSE, MVT::v4i64, 10 },
{ ISD::BITREVERSE, MVT::v8i32, 10 },		{ ISD::BITREVERSE, MVT::v8i32, 10 },
{ ISD::BITREVERSE, MVT::v16i16, 10 },		{ ISD::BITREVERSE, MVT::v16i16, 10 },
{ ISD::BITREVERSE, MVT::v32i8, 10 },		{ ISD::BITREVERSE, MVT::v32i8, 10 },
{ ISD::BSWAP, MVT::v4i64, 4 },		{ ISD::BSWAP, MVT::v4i64, 4 },
{ ISD::BSWAP, MVT::v8i32, 4 },		{ ISD::BSWAP, MVT::v8i32, 4 },
{ ISD::BSWAP, MVT::v16i16, 4 },		{ ISD::BSWAP, MVT::v16i16, 4 },
{ ISD::CTLZ, MVT::v4i64, 46 },		{ ISD::CTLZ, MVT::v4i64, 46 },
{ ISD::CTLZ, MVT::v8i32, 36 },		{ ISD::CTLZ, MVT::v8i32, 36 },
{ ISD::CTLZ, MVT::v16i16, 28 },		{ ISD::CTLZ, MVT::v16i16, 28 },
{ ISD::CTLZ, MVT::v32i8, 18 },		{ ISD::CTLZ, MVT::v32i8, 18 },
{ ISD::CTPOP, MVT::v4i64, 14 },		{ ISD::CTPOP, MVT::v4i64, 14 },
{ ISD::CTPOP, MVT::v8i32, 22 },		{ ISD::CTPOP, MVT::v8i32, 22 },
{ ISD::CTPOP, MVT::v16i16, 18 },		{ ISD::CTPOP, MVT::v16i16, 18 },
{ ISD::CTPOP, MVT::v32i8, 12 },		{ ISD::CTPOP, MVT::v32i8, 12 },
{ ISD::CTTZ, MVT::v4i64, 20 },		{ ISD::CTTZ, MVT::v4i64, 20 },
{ ISD::CTTZ, MVT::v8i32, 28 },		{ ISD::CTTZ, MVT::v8i32, 28 },
{ ISD::CTTZ, MVT::v16i16, 24 },		{ ISD::CTTZ, MVT::v16i16, 24 },
{ ISD::CTTZ, MVT::v32i8, 18 },		{ ISD::CTTZ, MVT::v32i8, 18 },
		{ ISD::FSQRT, MVT::f32, 14 }, // SNB from http://www.agner.org/
		{ ISD::FSQRT, MVT::v4f32, 14 }, // SNB from http://www.agner.org/
		{ ISD::FSQRT, MVT::v8f32, 28 }, // SNB from http://www.agner.org/
		{ ISD::FSQRT, MVT::f64, 21 }, // SNB from http://www.agner.org/
		{ ISD::FSQRT, MVT::v2f64, 21 }, // SNB from http://www.agner.org/
		{ ISD::FSQRT, MVT::v4f64, 43 }, // SNB from http://www.agner.org/
		};
		static const CostTblEntry SSE42CostTbl[] = {
		{ ISD::FSQRT, MVT::f32, 18 }, // Nehalem from http://www.agner.org/
		{ ISD::FSQRT, MVT::v4f32, 18 }, // Nehalem from http://www.agner.org/
};		};
static const CostTblEntry SSSE3CostTbl[] = {		static const CostTblEntry SSSE3CostTbl[] = {
{ ISD::BITREVERSE, MVT::v2i64, 5 },		{ ISD::BITREVERSE, MVT::v2i64, 5 },
{ ISD::BITREVERSE, MVT::v4i32, 5 },		{ ISD::BITREVERSE, MVT::v4i32, 5 },
{ ISD::BITREVERSE, MVT::v8i16, 5 },		{ ISD::BITREVERSE, MVT::v8i16, 5 },
{ ISD::BITREVERSE, MVT::v16i8, 5 },		{ ISD::BITREVERSE, MVT::v16i8, 5 },
{ ISD::BSWAP, MVT::v2i64, 1 },		{ ISD::BSWAP, MVT::v2i64, 1 },
{ ISD::BSWAP, MVT::v4i32, 1 },		{ ISD::BSWAP, MVT::v4i32, 1 },
Show All 18 Lines	static const CostTblEntry SSE2CostTbl[] = {
/* ISD::CTLZ - currently scalarized pre-SSSE3 */		/* ISD::CTLZ - currently scalarized pre-SSSE3 */
{ ISD::CTPOP, MVT::v2i64, 12 },		{ ISD::CTPOP, MVT::v2i64, 12 },
{ ISD::CTPOP, MVT::v4i32, 15 },		{ ISD::CTPOP, MVT::v4i32, 15 },
{ ISD::CTPOP, MVT::v8i16, 13 },		{ ISD::CTPOP, MVT::v8i16, 13 },
{ ISD::CTPOP, MVT::v16i8, 10 },		{ ISD::CTPOP, MVT::v16i8, 10 },
{ ISD::CTTZ, MVT::v2i64, 14 },		{ ISD::CTTZ, MVT::v2i64, 14 },
{ ISD::CTTZ, MVT::v4i32, 18 },		{ ISD::CTTZ, MVT::v4i32, 18 },
{ ISD::CTTZ, MVT::v8i16, 16 },		{ ISD::CTTZ, MVT::v8i16, 16 },
{ ISD::CTTZ, MVT::v16i8, 13 }		{ ISD::CTTZ, MVT::v16i8, 13 },
		{ ISD::FSQRT, MVT::f64, 32 }, // Nehalem from http://www.agner.org/
		{ ISD::FSQRT, MVT::v2f64, 32 }, // Nehalem from http://www.agner.org/
		};
		static const CostTblEntry SSE1CostTbl[] = {
		{ ISD::FSQRT, MVT::f32, 28 }, // Pentium III from http://www.agner.org/
		{ ISD::FSQRT, MVT::v4f32, 56 }, // Pentium III from http://www.agner.org/
};		};

unsigned ISD = ISD::DELETED_NODE;		unsigned ISD = ISD::DELETED_NODE;
switch (IID) {		switch (IID) {
default:		default:
break;		break;
case Intrinsic::bitreverse:		case Intrinsic::bitreverse:
ISD = ISD::BITREVERSE;		ISD = ISD::BITREVERSE;
break;		break;
case Intrinsic::bswap:		case Intrinsic::bswap:
ISD = ISD::BSWAP;		ISD = ISD::BSWAP;
break;		break;
case Intrinsic::ctlz:		case Intrinsic::ctlz:
ISD = ISD::CTLZ;		ISD = ISD::CTLZ;
break;		break;
case Intrinsic::ctpop:		case Intrinsic::ctpop:
ISD = ISD::CTPOP;		ISD = ISD::CTPOP;
break;		break;
case Intrinsic::cttz:		case Intrinsic::cttz:
ISD = ISD::CTTZ;		ISD = ISD::CTTZ;
break;		break;
		case Intrinsic::sqrt:
		ISD = ISD::FSQRT;
		break;
}		}

// Legalize the type.		// Legalize the type.
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, RetTy);		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, RetTy);
MVT MTy = LT.second;		MVT MTy = LT.second;

// Attempt to lookup cost.		// Attempt to lookup cost.
if (ST->hasXOP())		if (ST->hasXOP())
if (const auto *Entry = CostTableLookup(XOPCostTbl, ISD, MTy))		if (const auto *Entry = CostTableLookup(XOPCostTbl, ISD, MTy))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;

if (ST->hasAVX2())		if (ST->hasAVX2())
if (const auto *Entry = CostTableLookup(AVX2CostTbl, ISD, MTy))		if (const auto *Entry = CostTableLookup(AVX2CostTbl, ISD, MTy))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;

if (ST->hasAVX())		if (ST->hasAVX())
if (const auto *Entry = CostTableLookup(AVX1CostTbl, ISD, MTy))		if (const auto *Entry = CostTableLookup(AVX1CostTbl, ISD, MTy))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;

		if (ST->hasSSE42())
		if (const auto *Entry = CostTableLookup(SSE42CostTbl, ISD, MTy))
		return LT.first * Entry->Cost;

if (ST->hasSSSE3())		if (ST->hasSSSE3())
if (const auto *Entry = CostTableLookup(SSSE3CostTbl, ISD, MTy))		if (const auto *Entry = CostTableLookup(SSSE3CostTbl, ISD, MTy))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;

if (ST->hasSSE2())		if (ST->hasSSE2())
if (const auto *Entry = CostTableLookup(SSE2CostTbl, ISD, MTy))		if (const auto *Entry = CostTableLookup(SSE2CostTbl, ISD, MTy))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;

		if (ST->hasSSE1())
		if (const auto *Entry = CostTableLookup(SSE1CostTbl, ISD, MTy))
		return LT.first * Entry->Cost;

return BaseT::getIntrinsicInstrCost(IID, RetTy, Tys, FMF);		return BaseT::getIntrinsicInstrCost(IID, RetTy, Tys, FMF);
}		}

int X86TTIImpl::getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,		int X86TTIImpl::getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,
ArrayRef<Value *> Args, FastMathFlags FMF) {		ArrayRef<Value *> Args, FastMathFlags FMF) {
return BaseT::getIntrinsicInstrCost(IID, RetTy, Args, FMF);		return BaseT::getIntrinsicInstrCost(IID, RetTy, Args, FMF);
}		}

▲ Show 20 Lines • Show All 590 Lines • Show Last 20 Lines

llvm/trunk/test/Analysis/CostModel/X86/arith-fp.ll

; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE2		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE2
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse4.2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE42		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse4.2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE42
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx2,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX2		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx2,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX2
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512F		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512F
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f,+avx512bw \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512BW		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f,+avx512bw \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512BW

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"		target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.8.0"		target triple = "x86_64-apple-macosx10.8.0"

; CHECK-LABEL: 'fadd'		; CHECK-LABEL: 'fadd'
define i32 @fadd(i32 %arg) {		define i32 @fadd(i32 %arg) {
; SSE2: cost of 2 {{.*}} %F32 = fadd		; SSE2: cost of 2 {{.*}} %F32 = fadd
; SSE42: cost of 2 {{.*}} %F32 = fadd		; SSE42: cost of 2 {{.*}} %F32 = fadd
▲ Show 20 Lines • Show All 155 Lines • ▼ Show 20 Lines	define i32 @fmul(i32 %arg) {
; AVX512: cost of 2 {{.*}} %V8F64 = fmul		; AVX512: cost of 2 {{.*}} %V8F64 = fmul
%V8F64 = fmul <8 x double> undef, undef		%V8F64 = fmul <8 x double> undef, undef

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'fdiv'		; CHECK-LABEL: 'fdiv'
define i32 @fdiv(i32 %arg) {		define i32 @fdiv(i32 %arg) {
; SSE2: cost of 2 {{.*}} %F32 = fdiv		; SSE2: cost of 23 {{.*}} %F32 = fdiv
; SSE42: cost of 2 {{.*}} %F32 = fdiv		; SSE42: cost of 14 {{.*}} %F32 = fdiv
; AVX: cost of 2 {{.*}} %F32 = fdiv		; AVX: cost of 14 {{.*}} %F32 = fdiv
; AVX2: cost of 2 {{.*}} %F32 = fdiv		; AVX2: cost of 7 {{.*}} %F32 = fdiv
; AVX512: cost of 2 {{.*}} %F32 = fdiv		; AVX512: cost of 7 {{.*}} %F32 = fdiv
%F32 = fdiv float undef, undef		%F32 = fdiv float undef, undef
; SSE2: cost of 2 {{.*}} %V4F32 = fdiv		; SSE2: cost of 39 {{.*}} %V4F32 = fdiv
; SSE42: cost of 2 {{.*}} %V4F32 = fdiv		; SSE42: cost of 14 {{.*}} %V4F32 = fdiv
; AVX: cost of 2 {{.*}} %V4F32 = fdiv		; AVX: cost of 14 {{.*}} %V4F32 = fdiv
; AVX2: cost of 2 {{.*}} %V4F32 = fdiv		; AVX2: cost of 7 {{.*}} %V4F32 = fdiv
; AVX512: cost of 2 {{.*}} %V4F32 = fdiv		; AVX512: cost of 7 {{.*}} %V4F32 = fdiv
%V4F32 = fdiv <4 x float> undef, undef		%V4F32 = fdiv <4 x float> undef, undef
; SSE2: cost of 4 {{.*}} %V8F32 = fdiv		; SSE2: cost of 78 {{.*}} %V8F32 = fdiv
; SSE42: cost of 4 {{.*}} %V8F32 = fdiv		; SSE42: cost of 28 {{.*}} %V8F32 = fdiv
; AVX: cost of 2 {{.*}} %V8F32 = fdiv		; AVX: cost of 28 {{.*}} %V8F32 = fdiv
; AVX2: cost of 2 {{.*}} %V8F32 = fdiv		; AVX2: cost of 14 {{.*}} %V8F32 = fdiv
; AVX512: cost of 2 {{.*}} %V8F32 = fdiv		; AVX512: cost of 14 {{.*}} %V8F32 = fdiv
%V8F32 = fdiv <8 x float> undef, undef		%V8F32 = fdiv <8 x float> undef, undef
; SSE2: cost of 8 {{.*}} %V16F32 = fdiv		; SSE2: cost of 156 {{.*}} %V16F32 = fdiv
; SSE42: cost of 8 {{.*}} %V16F32 = fdiv		; SSE42: cost of 56 {{.*}} %V16F32 = fdiv
; AVX: cost of 4 {{.*}} %V16F32 = fdiv		; AVX: cost of 56 {{.*}} %V16F32 = fdiv
; AVX2: cost of 4 {{.*}} %V16F32 = fdiv		; AVX2: cost of 28 {{.*}} %V16F32 = fdiv
; AVX512: cost of 2 {{.*}} %V16F32 = fdiv		; AVX512: cost of 2 {{.*}} %V16F32 = fdiv
%V16F32 = fdiv <16 x float> undef, undef		%V16F32 = fdiv <16 x float> undef, undef

; SSE2: cost of 2 {{.*}} %F64 = fdiv		; SSE2: cost of 38 {{.*}} %F64 = fdiv
; SSE42: cost of 2 {{.*}} %F64 = fdiv		; SSE42: cost of 22 {{.*}} %F64 = fdiv
; AVX: cost of 2 {{.*}} %F64 = fdiv		; AVX: cost of 22 {{.*}} %F64 = fdiv
; AVX2: cost of 2 {{.*}} %F64 = fdiv		; AVX2: cost of 14 {{.*}} %F64 = fdiv
; AVX512: cost of 2 {{.*}} %F64 = fdiv		; AVX512: cost of 14 {{.*}} %F64 = fdiv
%F64 = fdiv double undef, undef		%F64 = fdiv double undef, undef
; SSE2: cost of 2 {{.*}} %V2F64 = fdiv		; SSE2: cost of 69 {{.*}} %V2F64 = fdiv
; SSE42: cost of 2 {{.*}} %V2F64 = fdiv		; SSE42: cost of 22 {{.*}} %V2F64 = fdiv
; AVX: cost of 2 {{.*}} %V2F64 = fdiv		; AVX: cost of 22 {{.*}} %V2F64 = fdiv
; AVX2: cost of 2 {{.*}} %V2F64 = fdiv		; AVX2: cost of 14 {{.*}} %V2F64 = fdiv
; AVX512: cost of 2 {{.*}} %V2F64 = fdiv		; AVX512: cost of 14 {{.*}} %V2F64 = fdiv
%V2F64 = fdiv <2 x double> undef, undef		%V2F64 = fdiv <2 x double> undef, undef
; SSE2: cost of 4 {{.*}} %V4F64 = fdiv		; SSE2: cost of 138 {{.*}} %V4F64 = fdiv
; SSE42: cost of 4 {{.*}} %V4F64 = fdiv		; SSE42: cost of 44 {{.*}} %V4F64 = fdiv
; AVX: cost of 2 {{.*}} %V4F64 = fdiv		; AVX: cost of 44 {{.*}} %V4F64 = fdiv
; AVX2: cost of 2 {{.*}} %V4F64 = fdiv		; AVX2: cost of 28 {{.*}} %V4F64 = fdiv
; AVX512: cost of 2 {{.*}} %V4F64 = fdiv		; AVX512: cost of 28 {{.*}} %V4F64 = fdiv
%V4F64 = fdiv <4 x double> undef, undef		%V4F64 = fdiv <4 x double> undef, undef
; SSE2: cost of 8 {{.*}} %V8F64 = fdiv		; SSE2: cost of 276 {{.*}} %V8F64 = fdiv
; SSE42: cost of 8 {{.*}} %V8F64 = fdiv		; SSE42: cost of 88 {{.*}} %V8F64 = fdiv
; AVX: cost of 4 {{.*}} %V8F64 = fdiv		; AVX: cost of 88 {{.*}} %V8F64 = fdiv
; AVX2: cost of 4 {{.*}} %V8F64 = fdiv		; AVX2: cost of 56 {{.*}} %V8F64 = fdiv
; AVX512: cost of 2 {{.*}} %V8F64 = fdiv		; AVX512: cost of 2 {{.*}} %V8F64 = fdiv
%V8F64 = fdiv <8 x double> undef, undef		%V8F64 = fdiv <8 x double> undef, undef

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'frem'		; CHECK-LABEL: 'frem'
define i32 @frem(i32 %arg) {		define i32 @frem(i32 %arg) {
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	define i32 @frem(i32 %arg) {
; AVX512: cost of 30 {{.*}} %V8F64 = frem		; AVX512: cost of 30 {{.*}} %V8F64 = frem
%V8F64 = frem <8 x double> undef, undef		%V8F64 = frem <8 x double> undef, undef

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'fsqrt'		; CHECK-LABEL: 'fsqrt'
define i32 @fsqrt(i32 %arg) {		define i32 @fsqrt(i32 %arg) {
; SSE2: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; SSE2: cost of 28 {{.*}} %F32 = call float @llvm.sqrt.f32
; SSE42: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; SSE42: cost of 18 {{.*}} %F32 = call float @llvm.sqrt.f32
; AVX: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; AVX: cost of 14 {{.*}} %F32 = call float @llvm.sqrt.f32
; AVX2: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; AVX2: cost of 7 {{.*}} %F32 = call float @llvm.sqrt.f32
; AVX512: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; AVX512: cost of 7 {{.*}} %F32 = call float @llvm.sqrt.f32
%F32 = call float @llvm.sqrt.f32(float undef)		%F32 = call float @llvm.sqrt.f32(float undef)
; SSE2: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; SSE2: cost of 56 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
; SSE42: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; SSE42: cost of 18 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
; AVX: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; AVX: cost of 14 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
; AVX2: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; AVX2: cost of 7 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
; AVX512: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; AVX512: cost of 7 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
%V4F32 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> undef)		%V4F32 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> undef)
; SSE2: cost of 4 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; SSE2: cost of 112 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
; SSE42: cost of 4 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; SSE42: cost of 36 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
; AVX: cost of 1 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; AVX: cost of 28 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
; AVX2: cost of 1 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; AVX2: cost of 14 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
; AVX512: cost of 1 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; AVX512: cost of 14 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
%V8F32 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> undef)		%V8F32 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> undef)
; SSE2: cost of 8 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; SSE2: cost of 224 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
; SSE42: cost of 8 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; SSE42: cost of 72 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
; AVX: cost of 4 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; AVX: cost of 56 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
; AVX2: cost of 4 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; AVX2: cost of 28 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
; AVX512: cost of 1 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; AVX512: cost of 1 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
%V16F32 = call <16 x float> @llvm.sqrt.v16f32(<16 x float> undef)		%V16F32 = call <16 x float> @llvm.sqrt.v16f32(<16 x float> undef)

; SSE2: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; SSE2: cost of 32 {{.*}} %F64 = call double @llvm.sqrt.f64
; SSE42: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; SSE42: cost of 32 {{.*}} %F64 = call double @llvm.sqrt.f64
; AVX: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; AVX: cost of 21 {{.*}} %F64 = call double @llvm.sqrt.f64
; AVX2: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; AVX2: cost of 14 {{.*}} %F64 = call double @llvm.sqrt.f64
; AVX512: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; AVX512: cost of 14 {{.*}} %F64 = call double @llvm.sqrt.f64
%F64 = call double @llvm.sqrt.f64(double undef)		%F64 = call double @llvm.sqrt.f64(double undef)
; SSE2: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; SSE2: cost of 32 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
; SSE42: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; SSE42: cost of 32 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
; AVX: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; AVX: cost of 21 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
; AVX2: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; AVX2: cost of 14 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
; AVX512: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; AVX512: cost of 14 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
%V2F64 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> undef)		%V2F64 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> undef)
; SSE2: cost of 4 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; SSE2: cost of 64 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
; SSE42: cost of 4 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; SSE42: cost of 64 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
; AVX: cost of 1 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; AVX: cost of 43 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
; AVX2: cost of 1 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; AVX2: cost of 28 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
; AVX512: cost of 1 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; AVX512: cost of 28 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
%V4F64 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> undef)		%V4F64 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> undef)
; SSE2: cost of 8 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; SSE2: cost of 128 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
; SSE42: cost of 8 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; SSE42: cost of 128 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
; AVX: cost of 4 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; AVX: cost of 86 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
; AVX2: cost of 4 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; AVX2: cost of 56 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
; AVX512: cost of 1 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; AVX512: cost of 1 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
%V8F64 = call <8 x double> @llvm.sqrt.v8f64(<8 x double> undef)		%V8F64 = call <8 x double> @llvm.sqrt.v8f64(<8 x double> undef)

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'fabs'		; CHECK-LABEL: 'fabs'
define i32 @fabs(i32 %arg) {		define i32 @fabs(i32 %arg) {
▲ Show 20 Lines • Show All 202 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Improved cost model for FDIV and FSQRTClosedPublic

Details

Diff Detail

Event Timeline

Throughput Analysis Report

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 1.0 14.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 0.0 |

- ESP Tracking sync uop was issued

Throughput Analysis Report

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 2.0 28.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 |

- ESP Tracking sync uop was issued

Revision Contents

Diff 76387

llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp

llvm/trunk/test/Analysis/CostModel/X86/arith-fp.ll

Improved cost model for FDIV and FSQRT
ClosedPublic