This is an archive of the discontinued LLVM Phabricator instance.

Improved cost model for FDIV and FSQRT
ClosedPublic

Authored by avt77 on Oct 18 2016, 5:01 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
ABataev
mkuper

Commits

rGd07c731d86d1: Improved cost model for FDIV and FSQRT, by Andrew Tischenko
rL285564: Improved cost model for FDIV and FSQRT, by Andrew Tischenko

Summary

There is a bug describing poor cost model for floating point operations: Bug 29083 - [X86][SSE] Improve costs for floating point operations. This patch is the second one in series of patches dealing with cost model.

Diff Detail

Event Timeline

avt77 updated this revision to Diff 74981.Oct 18 2016, 5:01 AM

avt77 retitled this revision from to Improved cost model for FDIV and FSQRT.

avt77 updated this object.

avt77 added reviewers: RKSimon, spatel, ABataev.

I updated cost numbers corresponding to Simon requirements

Cost model numbers related to Pentium and Nehalem were updated.

RKSimon added inline comments.Oct 20 2016, 10:19 AM

lib/Target/X86/X86TargetTransformInfo.cpp
374	AVXCustomCostTable was recently added - you can probably merge these into that table and avoid the extra lookup.
519	Put these above the SDIV/UDIV costs so it doesn't look like they are under the same 'avoid vectorize division' comment, which is purely an integer division thing.
571	if (ST->hasSSE())
1191	Move SSE1FloatCostTable below SSE2CostTbl to keep to 'descending subtarget' convention. You can probably rename it SSE1CostTble as well.
1192	Worth adding a SSE41CostTbl for Core2 era costs?
1287	if (ST->hasSSE())

RKSimon added a reviewer: mkuper.Oct 20 2016, 10:20 AM

RKSimon added a subscriber: llvm-commits.

mkuper added inline comments.Oct 20 2016, 11:00 AM

lib/Target/X86/X86TargetTransformInfo.cpp
370	A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd. I'd expect "2, maybe a bit more", and Agner seems to agree - e.g. for Sandybridge he gives a range of 10-14 cycles for XMM DIVPS, and 20-28 cycles for YMM VDIVPS. Were your IACA checks accounting for additional instructions? Or is this an inconsistency between IACA and Anger's tests? (Note that these numbers are supposed to represent reciprocal throughput, but Agner's data for latency also has factor of ~2)

avt77 added inline comments.Oct 25 2016, 1:12 AM

lib/Target/X86/X86TargetTransformInfo.cpp

370

I use the following numbers:
atischenko@ip-172-31-21-62:~/iaca-lin64/bin$ ./test-arith-fdiv.sh
*********SandyBridge*********
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - arith-fdiv.o
Binary Format - 64Bit
Architecture - SNB
Analysis Type - Throughput

Throughput Analysis Report

Block Throughput: 14.00 Cycles Throughput Bottleneck: Divider

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 1.0 14.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 0.0 |

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred

- instruction micro-ops not bound to a port

^ - Micro Fusion happened

- ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

Num Of	Ports pressure in cycles
Uops	0 - DV	1	2 - D	3 - D	4	5

1.0 14.0

vdivps xmm0, xmm0, xmm0

Total Num Of Uops: 1

atischenko@ip-172-31-21-62:~/iaca-lin64/bin$
./test-arith-fdiv.sh
*********SandyBridge*********
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - arith-fdiv.o
Binary Format - 64Bit
Architecture - SNB
Analysis Type - Throughput

Throughput Analysis Report

Block Throughput: 41.00 Cycles Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 2.0 28.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 |

- instruction micro-ops not bound to a port

^ - Micro Fusion happened

- ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

Num Of	Ports pressure in cycles
Uops	0 - DV	1	2 - D	3 - D	4	5

2.0 28.0

1.0

vdivps ymm0, ymm0, ymm0

Total Num Of Uops: 3

If we use DV value then it's about 2 times. But I used "Block Throughput" value above. As you see my block is exactly one instruction that's why I decided to use "Block Throughput". Maybe I'm wrong? Can anyone suggest me the right decision?

I changed everything except the issue with ymm FDIV. As I wrote in my answer on the comment I'm using Block Throughput from IACA tool. If I'm wrong please say me about and I'll recollect all numbers.

RKSimon added inline comments.Oct 25 2016, 4:28 AM

lib/Target/X86/X86TargetTransformInfo.cpp
1192	Please add Nehalem costs (from Agner) - they're notably better than the P4 default: FSQRT f32/4f32 : 18 f64/2f64 : 32

I'm swaying toward using Agner's numbers for SandyBridge FDIV / FSQRT costs - would anyone have any objections?

In D25722#578473, @RKSimon wrote:

I'm swaying toward using Agner's numbers for SandyBridge FDIV / FSQRT costs - would anyone have any objections?

I'd really like to understand the discrepancy between Agner's numbers and the IACA numbers - especially since IACA does give the expected number for the latency.
Andrew, any chance you could ping the IACA people at Intel and ask them about this?

Meanwhile, I think I'd also prefer this to go in with Agner's numbers - and change it to the IACA numbers later if it turns out it's accounting for something real.

In D25722#578664, @mkuper wrote:

I'd really like to understand the discrepancy between Agner's numbers and the IACA numbers - especially since IACA does give the expected number for the latency.
Andrew, any chance you could ping the IACA people at Intel and ask them about this?

IACA doesn't have that great support these days - the website says development has been suspended, although they have announced they will do a BDW/SKL bugfix release at some point. Similar comments about dodgy costs don't seem to be answered.

In D25722#578712, @RKSimon wrote:

In D25722#578664, @mkuper wrote:

I'd really like to understand the discrepancy between Agner's numbers and the IACA numbers - especially since IACA does give the expected number for the latency.
Andrew, any chance you could ping the IACA people at Intel and ask them about this?

IACA doesn't have that great support these days - the website says development has been suspended, although they have announced they will do a BDW/SKL bugfix release at some point. Similar comments about dodgy costs don't seem to be answered.

To the best of my knowledge, Andrew works for Intel, so he may have a better chance of getting an answer. :-)
As to support - the website has a comment from July that states that "[they] are resuming support for Intel(R) Architecture Code Analyzer with BDW and SKL support probably before end of 2016".

So, one can hope...

jrmuizel added a subscriber: jrmuizel.Oct 25 2016, 1:42 PM

If I understood correctly I should replace all IACA numbers with Agner's numbers, right? OK, I'll do it.
JFYI, I'm not working in Intel since July but of course I know a lot of guys from Intel and I'll try to ask them about IACA future.

avt77 added inline comments.Oct 27 2016, 6:35 AM

lib/Target/X86/X86TargetTransformInfo.cpp
370	Maybe I understood the difference between Anger's and IACA numbers (for SandyBridge at least): Anger's table has numbers for ymm only : VDIVPS y,y,y 3 3 2 1 21-29 20-28 AVX VDIVPS y,y,m256 4 3 2 1 1+ 20-28 AVX but when I played with IACA it showed different types of operands: xmm0, ... and ymm0, ....
370	And another comment: for f32/v4f32 IACA shows 14 for vdivss/vdivps but Agner's table shows the same 10-14 for divss/divps and 20-28 for vdivss/vdivps. Clang generates vdivss/vdivps (for SNB target) that's why I should select 28 as a cost value is we're using Agner's numbers. Is it OK?
370	As an intermediate decision I did the following for SNB numbers: xmm operands use the first number as a cost while ymm operands use the second number. For example: { ISD::FDIV, MVT::f32, 20 }, // SNB from http://www.agner.org/ (IACA: 14) { ISD::FDIV, MVT::v4f32, 20 }, // SNB from http://www.agner.org/ (IACA: 14) { ISD::FDIV, MVT::v8f32, 28 }, // SNB from http://www.agner.org/ (IACA: 41) Is it acceptable? In fact we need here some extension of our current Cost tables: it is not enough to know the instruction itself and the type of its operands. We need some target dependent info as well (e.g. if we're speaking about X86 targets only (exactly our case) then we could add the size of the target registers or something similar, e.g. ISA: SNB/HSW could use both divps and vdivps and usage of 2 divps could be better then usage of one vdivps but our current Cost Model cannot help to decide what's better).
370	BTW, I've just realized that Haswell IACA numbers are really closed to the expected ones: f32 13 vdivss v4f32 13 vdivps xmm0... v8f32 26 vdivps ymm0... f64 20 vdivsd v2f64 20 vdivpd v4f62 47 vdivpd ymm0... It means we have problems with IACA numbers for SNB only (too big difference for xmm and ymm). Maybe new CPUs simply resolved this issue?
1192	JFYI, I got the same numbers for Nehalem with IACA

All numbers from IACA were replaced with Agner's numbers

The wrong SNB numbers were fixed (tnx to Simon Pilgrim)

I think this is almost done now - please can you add the AVX2/Haswell costs:

Haswell
FDIV f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28
FSQRT f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28

Other than that does anyone have any additional feedback?

In D25722#582086, @RKSimon wrote:

I think this is almost done now - please can you add the AVX2/Haswell costs:

Haswell
FDIV f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28
FSQRT f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28

Other than that does anyone have any additional feedback?

No, LGTM (modulo what Simon said about Haswell)

Thanks for fixing this, Andrew.

Haswell numbers added for AVX2

LGTM with one minor

lib/Target/X86/X86TargetTransformInfo.cpp
1231	Better to use the Pentium III costs F32 = 28, VF432 = 56

This revision is now accepted and ready to land.Oct 29 2016, 4:36 AM

FSQRT changes: SSE1 Cost table updated with Pentium III numbers; SSE42 cost table added with Nehalem numbers

Closed by commit rL285564: Improved cost model for FDIV and FSQRT, by Andrew Tischenko (authored by ABataev). · Explain WhyOct 31 2016, 5:20 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

X86TargetTransformInfo.cpp

76 lines

test/

Analysis/

CostModel/

X86/

arith-fp.ll

164 lines

Diff 76366

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 318 Lines • ▼ Show 20 Lines

static const CostTblEntry AVX2CustomCostTable[] = {

{ ISD::SRL, MVT::v32i8, 11 }, // vpblendvb sequence.

{ ISD::SRL, MVT::v16i16, 10 }, // extend/vpsrlvd/pack sequence.

{ ISD::SRA, MVT::v32i8, 24 }, // vpblendvb sequence.

{ ISD::SRA, MVT::v16i16, 10 }, // extend/vpsravd/pack sequence.

{ ISD::SRA, MVT::v2i64, 4 }, // srl/xor/sub sequence.

{ ISD::SRA, MVT::v4i64, 4 }, // srl/xor/sub sequence.

{ ISD::FDIV, MVT::f32, 7 }, // Haswell from http://www.agner.org/

{ ISD::FDIV, MVT::v4f32, 7 }, // Haswell from http://www.agner.org/

{ ISD::FDIV, MVT::v8f32, 14 }, // Haswell from http://www.agner.org/

{ ISD::FDIV, MVT::f64, 14 }, // Haswell from http://www.agner.org/

{ ISD::FDIV, MVT::v2f64, 14 }, // Haswell from http://www.agner.org/

{ ISD::FDIV, MVT::v4f64, 28 }, // Haswell from http://www.agner.org/

};

// Look for AVX2 lowering tricks for custom cases.

if (ST->hasAVX2()) {

if (const auto *Entry = CostTableLookup(AVX2CustomCostTable, ISD,

LT.second))

return LT.first * Entry->Cost;

}

static const CostTblEntry AVXCustomCostTable[] = {

{ ISD::FDIV, MVT::f32, 14 }, // SNB from http://www.agner.org/

{ ISD::FDIV, MVT::v4f32, 14 }, // SNB from http://www.agner.org/

{ ISD::FDIV, MVT::v8f32, 28 }, // SNB from http://www.agner.org/

{ ISD::FDIV, MVT::f64, 22 }, // SNB from http://www.agner.org/

{ ISD::FDIV, MVT::v2f64, 22 }, // SNB from http://www.agner.org/

{ ISD::FDIV, MVT::v4f64, 44 }, // SNB from http://www.agner.org/

// Vectorizing division is a bad idea. See the SSE2 table for more comments.

{ ISD::SDIV, MVT::v32i8, 32*20 },

{ ISD::SDIV, MVT::v16i16, 16*20 },

{ ISD::SDIV, MVT::v8i32, 8*20 },

{ ISD::SDIV, MVT::v4i64, 4*20 },

{ ISD::UDIV, MVT::v32i8, 32*20 },

{ ISD::UDIV, MVT::v16i16, 16*20 },

{ ISD::UDIV, MVT::v8i32, 8*20 },

{ ISD::UDIV, MVT::v4i64, 4*20 },

};

// Look for AVX2 lowering tricks for custom cases.

if (ST->hasAVX()) {

if (const auto *Entry = CostTableLookup(AVXCustomCostTable, ISD,

LT.second))

return LT.first * Entry->Cost;

}

static const CostTblEntry SSE42FloatCostTable[] = {

{ ISD::FDIV, MVT::f32, 14 }, // Nehalem from http://www.agner.org/

{ ISD::FDIV, MVT::v4f32, 14 }, // Nehalem from http://www.agner.org/

{ ISD::FDIV, MVT::f64, 22 }, // Nehalem from http://www.agner.org/

mkuperUnsubmitted

Not Done

A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd.

I'd expect "2, maybe a bit more", and Agner seems to agree - e.g. for Sandybridge he gives a range of 10-14 cycles for XMM DIVPS, and 20-28 cycles for YMM VDIVPS. Were your IACA checks accounting for additional instructions? Or is this an inconsistency between IACA and Anger's tests?

(Note that these numbers are supposed to represent reciprocal throughput, but Agner's data for latency also has factor of ~2)

mkuper: A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd. I'd expect "2, maybe a…

avt77AuthorUnsubmitted

Not Done

Throughput Analysis Report

Block Throughput: 14.00 Cycles Throughput Bottleneck: Divider

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 1.0 14.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 0.0 |

- instruction micro-ops not bound to a port

^ - Micro Fusion happened

- ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

Num Of	Ports pressure in cycles
Uops	0 - DV	1	2 - D	3 - D	4	5

1.0 14.0

vdivps xmm0, xmm0, xmm0

Total Num Of Uops: 1

Throughput Analysis Report

Block Throughput: 41.00 Cycles Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 2.0 28.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 |

- instruction micro-ops not bound to a port

^ - Micro Fusion happened

- ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

Num Of	Ports pressure in cycles
Uops	0 - DV	1	2 - D	3 - D	4	5

2.0 28.0

1.0

vdivps ymm0, ymm0, ymm0

Total Num Of Uops: 3

avt77: I use the following numbers: atischenko@ip-172-31-21-62:~/iaca-lin64/bin$ ./test-arith-fdiv.sh…

avt77AuthorUnsubmitted

Not Done

Maybe I understood the difference between Anger's and IACA numbers (for SandyBridge at least): Anger's table has numbers for ymm only :
VDIVPS y,y,y 3 3 2 1 21-29 20-28 AVX
VDIVPS y,y,m256 4 3 2 1 1+ 20-28 AVX

but when I played with IACA it showed different types of operands: xmm0, ... and ymm0, ....

avt77: Maybe I understood the difference between Anger's and IACA numbers (for SandyBridge at least)…

avt77AuthorUnsubmitted

Not Done

And another comment: for f32/v4f32 IACA shows 14 for vdivss/vdivps but Agner's table shows the same 10-14 for divss/divps and 20-28 for vdivss/vdivps. Clang generates vdivss/vdivps (for SNB target) that's why I should select 28 as a cost value is we're using Agner's numbers. Is it OK?

avt77: And another comment: for f32/v4f32 IACA shows 14 for vdivss/vdivps but Agner's table shows the…

avt77AuthorUnsubmitted

Not Done

As an intermediate decision I did the following for SNB numbers: xmm operands use the first number as a cost while ymm operands use the second number. For example:

{ ISD::FDIV, MVT::f32,   20 }, // SNB from http://www.agner.org/ (IACA: 14)
{ ISD::FDIV, MVT::v4f32, 20 }, // SNB from http://www.agner.org/ (IACA: 14)
{ ISD::FDIV, MVT::v8f32, 28 }, // SNB from http://www.agner.org/ (IACA: 41)

Is it acceptable? In fact we need here some extension of our current Cost tables: it is not enough to know the instruction itself and the type of its operands. We need some target dependent info as well (e.g. if we're speaking about X86 targets only (exactly our case) then we could add the size of the target registers or something similar, e.g. ISA: SNB/HSW could use both divps and vdivps and usage of 2 divps could be better then usage of one vdivps but our current Cost Model cannot help to decide what's better).

avt77: As an intermediate decision I did the following for SNB numbers: xmm operands use the first…

avt77AuthorUnsubmitted

Not Done

BTW, I've just realized that Haswell IACA numbers are really closed to the expected ones:

f32 13 vdivss
v4f32 13 vdivps xmm0...
v8f32 26 vdivps ymm0...
f64 20 vdivsd
v2f64 20 vdivpd
v4f62 47 vdivpd ymm0...

It means we have problems with IACA numbers for SNB only (too big difference for xmm and ymm). Maybe new CPUs simply resolved this issue?

avt77: BTW, I've just realized that Haswell IACA numbers are really closed to the expected ones: f32…

{ ISD::FDIV, MVT::v2f64, 22 }, // Nehalem from http://www.agner.org/

};

if (ST->hasSSE42()) {

RKSimonUnsubmitted

Not Done

AVXCustomCostTable was recently added - you can probably merge these into that table and avoid the extra lookup.

RKSimon: AVXCustomCostTable was recently added - you can probably merge these into that table and avoid…

if (const auto *Entry = CostTableLookup(SSE42FloatCostTable, ISD,

LT.second))

return LT.first * Entry->Cost;

}

static const CostTblEntry

SSE2UniformCostTable[] = {

// Uniform splats are cheaper for the following instructions.

{ ISD::SHL, MVT::v16i8, 1 }, // psllw.

{ ISD::SHL, MVT::v32i8, 2 }, // psllw.

{ ISD::SHL, MVT::v8i16, 1 }, // psllw.

{ ISD::SHL, MVT::v16i16, 2 }, // psllw.

{ ISD::SHL, MVT::v4i32, 1 }, // pslld

▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines

static const CostTblEntry SSE2CostTable[] = {

{ ISD::SRA, MVT::v32i8, 2*54 }, // unpacked cmpgtb sequence.

{ ISD::SRA, MVT::v8i16, 32 }, // cmpgtb sequence.

{ ISD::SRA, MVT::v16i16, 2*32 }, // cmpgtb sequence.

{ ISD::SRA, MVT::v4i32, 16 }, // Shift each lane + blend.

{ ISD::SRA, MVT::v8i32, 2*16 }, // Shift each lane + blend.

{ ISD::SRA, MVT::v2i64, 12 }, // srl/xor/sub sequence.

{ ISD::SRA, MVT::v4i64, 2*12 }, // srl/xor/sub sequence.

{ ISD::FDIV, MVT::f32, 23 }, // Pentium IV from http://www.agner.org/

{ ISD::FDIV, MVT::v4f32, 39 }, // Pentium IV from http://www.agner.org/

{ ISD::FDIV, MVT::f64, 38 }, // Pentium IV from http://www.agner.org/

{ ISD::FDIV, MVT::v2f64, 69 }, // Pentium IV from http://www.agner.org/

// It is not a good idea to vectorize division. We have to scalarize it and

// in the process we will often end up having to spilling regular

// registers. The overhead of division is going to dominate most kernels

// anyways so try hard to prevent vectorization of division - it is

// generally a bad idea. Assume somewhat arbitrarily that we have to be able

// to hide "20 cycles" for each lane.

{ ISD::SDIV, MVT::v16i8, 16*20 },

{ ISD::SDIV, MVT::v8i16, 8*20 },

{ ISD::SDIV, MVT::v4i32, 4*20 },

{ ISD::SDIV, MVT::v2i64, 2*20 },

{ ISD::UDIV, MVT::v16i8, 16*20 },

{ ISD::UDIV, MVT::v8i16, 8*20 },

{ ISD::UDIV, MVT::v4i32, 4*20 },

{ ISD::UDIV, MVT::v2i64, 2*20 },

};

if (ST->hasSSE2()) {

if (const auto *Entry = CostTableLookup(SSE2CostTable, ISD, LT.second))

RKSimonUnsubmitted

Not Done

Put these above the SDIV/UDIV costs so it doesn't look like they are under the same 'avoid vectorize division' comment, which is purely an integer division thing.

RKSimon: Put these above the SDIV/UDIV costs so it doesn't look like they are under the same 'avoid…

return LT.first * Entry->Cost;

}

static const CostTblEntry AVX1CostTable[] = {

// We don't have to scalarize unsupported ops. We can issue two half-sized

// operations and we only need to extract the upper YMM half.

// Two ops + 1 extract + 1 insert = 4.

{ ISD::MUL, MVT::v16i16, 4 },

Show All 30 Lines

if (const auto *Entry = CostTableLookup(CustomLowered, ISD, LT.second))

return LT.first * Entry->Cost;

// Special lowering of v4i32 mul on sse2, sse3: Lower v4i32 mul as 2x shuffle,

// 2x pmuludq, 2x shuffle.

if (ISD == ISD::MUL && LT.second == MVT::v4i32 && ST->hasSSE2() &&

!ST->hasSSE41())

return LT.first * 6;

static const CostTblEntry SSE1FloatCostTable[] = {

{ ISD::FDIV, MVT::f32, 17 }, // Pentium III from http://www.agner.org/

{ ISD::FDIV, MVT::v4f32, 34 }, // Pentium III from http://www.agner.org/

};

if (ST->hasSSE1())

RKSimonUnsubmitted

Not Done

if (ST->hasSSE())

RKSimon: if (ST->hasSSE())

if (const auto *Entry = CostTableLookup(SSE1FloatCostTable, ISD,

LT.second))

return LT.first * Entry->Cost;

// Fallback to the default implementation.

return BaseT::getArithmeticInstrCost(Opcode, Ty, Op1Info, Op2Info);

}

int X86TTIImpl::getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,

Type *SubTp) {

// We only estimate the cost of reverse and alternate shuffles.

if (Kind != TTI::SK_Reverse && Kind != TTI::SK_Alternate)

▲ Show 20 Lines • Show All 562 Lines • ▼ Show 20 Lines

static const CostTblEntry AVX2CostTbl[] = {

{ ISD::CTLZ, MVT::v32i8, 9 },

{ ISD::CTPOP, MVT::v4i64, 7 },

{ ISD::CTPOP, MVT::v8i32, 11 },

{ ISD::CTPOP, MVT::v16i16, 9 },

{ ISD::CTPOP, MVT::v32i8, 6 },

{ ISD::CTTZ, MVT::v4i64, 10 },

{ ISD::CTTZ, MVT::v8i32, 14 },

{ ISD::CTTZ, MVT::v16i16, 12 },

{ ISD::CTTZ, MVT::v32i8, 9 }

{ ISD::CTTZ, MVT::v32i8, 9 },

{ ISD::FSQRT, MVT::f32, 7 }, // Haswell from http://www.agner.org/

{ ISD::FSQRT, MVT::v4f32, 7 }, // Haswell from http://www.agner.org/

{ ISD::FSQRT, MVT::v8f32, 14 }, // Haswell from http://www.agner.org/

{ ISD::FSQRT, MVT::f64, 14 }, // Haswell from http://www.agner.org/

{ ISD::FSQRT, MVT::v2f64, 14 }, // Haswell from http://www.agner.org/

{ ISD::FSQRT, MVT::v4f64, 28 }, // Haswell from http://www.agner.org/

};

static const CostTblEntry AVX1CostTbl[] = {

{ ISD::BITREVERSE, MVT::v4i64, 10 },

{ ISD::BITREVERSE, MVT::v8i32, 10 },

{ ISD::BITREVERSE, MVT::v16i16, 10 },

{ ISD::BITREVERSE, MVT::v32i8, 10 },

{ ISD::BSWAP, MVT::v4i64, 4 },

{ ISD::BSWAP, MVT::v8i32, 4 },

{ ISD::BSWAP, MVT::v16i16, 4 },

{ ISD::CTLZ, MVT::v4i64, 46 },

{ ISD::CTLZ, MVT::v8i32, 36 },

{ ISD::CTLZ, MVT::v16i16, 28 },

{ ISD::CTLZ, MVT::v32i8, 18 },

{ ISD::CTPOP, MVT::v4i64, 14 },

{ ISD::CTPOP, MVT::v8i32, 22 },

{ ISD::CTPOP, MVT::v16i16, 18 },

{ ISD::CTPOP, MVT::v32i8, 12 },

{ ISD::CTTZ, MVT::v4i64, 20 },

{ ISD::CTTZ, MVT::v8i32, 28 },

{ ISD::CTTZ, MVT::v16i16, 24 },

{ ISD::CTTZ, MVT::v32i8, 18 },

{ ISD::FSQRT, MVT::f32, 14 }, // SNB from http://www.agner.org/

{ ISD::FSQRT, MVT::v4f32, 14 }, // SNB from http://www.agner.org/

{ ISD::FSQRT, MVT::v8f32, 28 }, // SNB from http://www.agner.org/

{ ISD::FSQRT, MVT::f64, 21 }, // SNB from http://www.agner.org/

{ ISD::FSQRT, MVT::v2f64, 21 }, // SNB from http://www.agner.org/

{ ISD::FSQRT, MVT::v4f64, 43 }, // SNB from http://www.agner.org/

};

static const CostTblEntry SSE42CostTbl[] = {

{ ISD::FSQRT, MVT::f32, 18 }, // Nehalem from http://www.agner.org/

{ ISD::FSQRT, MVT::v4f32, 18 }, // Nehalem from http://www.agner.org/

};

RKSimonUnsubmitted

Not Done

Move SSE1FloatCostTable below SSE2CostTbl to keep to 'descending subtarget' convention. You can probably rename it SSE1CostTble as well.

RKSimon: Move SSE1FloatCostTable below SSE2CostTbl to keep to 'descending subtarget' convention. You can…

static const CostTblEntry SSSE3CostTbl[] = {

RKSimonUnsubmitted

Not Done

Worth adding a SSE41CostTbl for Core2 era costs?

RKSimon: Worth adding a SSE41CostTbl for Core2 era costs?

RKSimonUnsubmitted

Not Done

Please add Nehalem costs (from Agner) - they're notably better than the P4 default:

FSQRT f32/4f32 : 18 f64/2f64 : 32

RKSimon: Please add Nehalem costs (from Agner) - they're notably better than the P4 default: FSQRT…

avt77AuthorUnsubmitted

Not Done

JFYI, I got the same numbers for Nehalem with IACA

avt77: JFYI, I got the same numbers for Nehalem with IACA

{ ISD::BITREVERSE, MVT::v2i64, 5 },

{ ISD::BITREVERSE, MVT::v4i32, 5 },

{ ISD::BITREVERSE, MVT::v8i16, 5 },

{ ISD::BITREVERSE, MVT::v16i8, 5 },

{ ISD::BSWAP, MVT::v2i64, 1 },

{ ISD::BSWAP, MVT::v4i32, 1 },

{ ISD::BSWAP, MVT::v8i16, 1 },

{ ISD::CTLZ, MVT::v2i64, 23 },

Show All 16 Lines

static const CostTblEntry SSE2CostTbl[] = {

/* ISD::CTLZ - currently scalarized pre-SSSE3 */

{ ISD::CTPOP, MVT::v2i64, 12 },

{ ISD::CTPOP, MVT::v4i32, 15 },

{ ISD::CTPOP, MVT::v8i16, 13 },

{ ISD::CTPOP, MVT::v16i8, 10 },

{ ISD::CTTZ, MVT::v2i64, 14 },

{ ISD::CTTZ, MVT::v4i32, 18 },

{ ISD::CTTZ, MVT::v8i16, 16 },

{ ISD::CTTZ, MVT::v16i8, 13 }

{ ISD::CTTZ, MVT::v16i8, 13 },

{ ISD::FSQRT, MVT::f64, 32 }, // Nehalem from http://www.agner.org/

{ ISD::FSQRT, MVT::v2f64, 32 }, // Nehalem from http://www.agner.org/

};

static const CostTblEntry SSE1CostTbl[] = {

{ ISD::FSQRT, MVT::f32, 28 }, // Pentium III from http://www.agner.org/

{ ISD::FSQRT, MVT::v4f32, 56 }, // Pentium III from http://www.agner.org/

RKSimonUnsubmitted

Not Done

Better to use the Pentium III costs F32 = 28, VF432 = 56

RKSimon: Better to use the Pentium III costs F32 = 28, VF432 = 56

};

unsigned ISD = ISD::DELETED_NODE;

switch (IID) {

default:

break;

case Intrinsic::bitreverse:

ISD = ISD::BITREVERSE;

break;

case Intrinsic::bswap:

ISD = ISD::BSWAP;

break;

case Intrinsic::ctlz:

ISD = ISD::CTLZ;

break;

case Intrinsic::ctpop:

ISD = ISD::CTPOP;

break;

case Intrinsic::cttz:

ISD = ISD::CTTZ;

break;

case Intrinsic::sqrt:

ISD = ISD::FSQRT;

break;

}

// Legalize the type.

std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, RetTy);

MVT MTy = LT.second;

// Attempt to lookup cost.

if (ST->hasXOP())

if (const auto *Entry = CostTableLookup(XOPCostTbl, ISD, MTy))

return LT.first * Entry->Cost;

if (ST->hasAVX2())

if (const auto *Entry = CostTableLookup(AVX2CostTbl, ISD, MTy))

return LT.first * Entry->Cost;

if (ST->hasAVX())

if (const auto *Entry = CostTableLookup(AVX1CostTbl, ISD, MTy))

return LT.first * Entry->Cost;

if (ST->hasSSE42())

if (const auto *Entry = CostTableLookup(SSE42CostTbl, ISD, MTy))

return LT.first * Entry->Cost;

if (ST->hasSSSE3())

if (const auto *Entry = CostTableLookup(SSSE3CostTbl, ISD, MTy))

return LT.first * Entry->Cost;

if (ST->hasSSE2())

if (const auto *Entry = CostTableLookup(SSE2CostTbl, ISD, MTy))

return LT.first * Entry->Cost;

if (ST->hasSSE1())

RKSimonUnsubmitted

Not Done

if (ST->hasSSE())

RKSimon: if (ST->hasSSE())

if (const auto *Entry = CostTableLookup(SSE1CostTbl, ISD, MTy))

return LT.first * Entry->Cost;

return BaseT::getIntrinsicInstrCost(IID, RetTy, Tys, FMF);

}

int X86TTIImpl::getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,

ArrayRef<Value *> Args, FastMathFlags FMF) {

return BaseT::getIntrinsicInstrCost(IID, RetTy, Args, FMF);

}

▲ Show 20 Lines • Show All 590 Lines • Show Last 20 Lines

test/Analysis/CostModel/X86/arith-fp.ll

; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE2		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE2
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse4.2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE42		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse4.2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE42
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx2,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX2		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx2,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX2
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512F		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512F
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f,+avx512bw \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512BW		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f,+avx512bw \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512BW

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"		target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.8.0"		target triple = "x86_64-apple-macosx10.8.0"

; CHECK-LABEL: 'fadd'		; CHECK-LABEL: 'fadd'
define i32 @fadd(i32 %arg) {		define i32 @fadd(i32 %arg) {
; SSE2: cost of 2 {{.*}} %F32 = fadd		; SSE2: cost of 2 {{.*}} %F32 = fadd
; SSE42: cost of 2 {{.*}} %F32 = fadd		; SSE42: cost of 2 {{.*}} %F32 = fadd
▲ Show 20 Lines • Show All 155 Lines • ▼ Show 20 Lines	define i32 @fmul(i32 %arg) {
; AVX512: cost of 2 {{.*}} %V8F64 = fmul		; AVX512: cost of 2 {{.*}} %V8F64 = fmul
%V8F64 = fmul <8 x double> undef, undef		%V8F64 = fmul <8 x double> undef, undef

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'fdiv'		; CHECK-LABEL: 'fdiv'
define i32 @fdiv(i32 %arg) {		define i32 @fdiv(i32 %arg) {
; SSE2: cost of 2 {{.*}} %F32 = fdiv		; SSE2: cost of 23 {{.*}} %F32 = fdiv
; SSE42: cost of 2 {{.*}} %F32 = fdiv		; SSE42: cost of 14 {{.*}} %F32 = fdiv
; AVX: cost of 2 {{.*}} %F32 = fdiv		; AVX: cost of 14 {{.*}} %F32 = fdiv
; AVX2: cost of 2 {{.*}} %F32 = fdiv		; AVX2: cost of 7 {{.*}} %F32 = fdiv
; AVX512: cost of 2 {{.*}} %F32 = fdiv		; AVX512: cost of 7 {{.*}} %F32 = fdiv
%F32 = fdiv float undef, undef		%F32 = fdiv float undef, undef
; SSE2: cost of 2 {{.*}} %V4F32 = fdiv		; SSE2: cost of 39 {{.*}} %V4F32 = fdiv
; SSE42: cost of 2 {{.*}} %V4F32 = fdiv		; SSE42: cost of 14 {{.*}} %V4F32 = fdiv
; AVX: cost of 2 {{.*}} %V4F32 = fdiv		; AVX: cost of 14 {{.*}} %V4F32 = fdiv
; AVX2: cost of 2 {{.*}} %V4F32 = fdiv		; AVX2: cost of 7 {{.*}} %V4F32 = fdiv
; AVX512: cost of 2 {{.*}} %V4F32 = fdiv		; AVX512: cost of 7 {{.*}} %V4F32 = fdiv
%V4F32 = fdiv <4 x float> undef, undef		%V4F32 = fdiv <4 x float> undef, undef
; SSE2: cost of 4 {{.*}} %V8F32 = fdiv		; SSE2: cost of 78 {{.*}} %V8F32 = fdiv
; SSE42: cost of 4 {{.*}} %V8F32 = fdiv		; SSE42: cost of 28 {{.*}} %V8F32 = fdiv
; AVX: cost of 2 {{.*}} %V8F32 = fdiv		; AVX: cost of 28 {{.*}} %V8F32 = fdiv
; AVX2: cost of 2 {{.*}} %V8F32 = fdiv		; AVX2: cost of 14 {{.*}} %V8F32 = fdiv
; AVX512: cost of 2 {{.*}} %V8F32 = fdiv		; AVX512: cost of 14 {{.*}} %V8F32 = fdiv
%V8F32 = fdiv <8 x float> undef, undef		%V8F32 = fdiv <8 x float> undef, undef
; SSE2: cost of 8 {{.*}} %V16F32 = fdiv		; SSE2: cost of 156 {{.*}} %V16F32 = fdiv
; SSE42: cost of 8 {{.*}} %V16F32 = fdiv		; SSE42: cost of 56 {{.*}} %V16F32 = fdiv
; AVX: cost of 4 {{.*}} %V16F32 = fdiv		; AVX: cost of 56 {{.*}} %V16F32 = fdiv
; AVX2: cost of 4 {{.*}} %V16F32 = fdiv		; AVX2: cost of 28 {{.*}} %V16F32 = fdiv
; AVX512: cost of 2 {{.*}} %V16F32 = fdiv		; AVX512: cost of 2 {{.*}} %V16F32 = fdiv
%V16F32 = fdiv <16 x float> undef, undef		%V16F32 = fdiv <16 x float> undef, undef

; SSE2: cost of 2 {{.*}} %F64 = fdiv		; SSE2: cost of 38 {{.*}} %F64 = fdiv
; SSE42: cost of 2 {{.*}} %F64 = fdiv		; SSE42: cost of 22 {{.*}} %F64 = fdiv
; AVX: cost of 2 {{.*}} %F64 = fdiv		; AVX: cost of 22 {{.*}} %F64 = fdiv
; AVX2: cost of 2 {{.*}} %F64 = fdiv		; AVX2: cost of 14 {{.*}} %F64 = fdiv
; AVX512: cost of 2 {{.*}} %F64 = fdiv		; AVX512: cost of 14 {{.*}} %F64 = fdiv
%F64 = fdiv double undef, undef		%F64 = fdiv double undef, undef
; SSE2: cost of 2 {{.*}} %V2F64 = fdiv		; SSE2: cost of 69 {{.*}} %V2F64 = fdiv
; SSE42: cost of 2 {{.*}} %V2F64 = fdiv		; SSE42: cost of 22 {{.*}} %V2F64 = fdiv
; AVX: cost of 2 {{.*}} %V2F64 = fdiv		; AVX: cost of 22 {{.*}} %V2F64 = fdiv
; AVX2: cost of 2 {{.*}} %V2F64 = fdiv		; AVX2: cost of 14 {{.*}} %V2F64 = fdiv
; AVX512: cost of 2 {{.*}} %V2F64 = fdiv		; AVX512: cost of 14 {{.*}} %V2F64 = fdiv
%V2F64 = fdiv <2 x double> undef, undef		%V2F64 = fdiv <2 x double> undef, undef
; SSE2: cost of 4 {{.*}} %V4F64 = fdiv		; SSE2: cost of 138 {{.*}} %V4F64 = fdiv
; SSE42: cost of 4 {{.*}} %V4F64 = fdiv		; SSE42: cost of 44 {{.*}} %V4F64 = fdiv
; AVX: cost of 2 {{.*}} %V4F64 = fdiv		; AVX: cost of 44 {{.*}} %V4F64 = fdiv
; AVX2: cost of 2 {{.*}} %V4F64 = fdiv		; AVX2: cost of 28 {{.*}} %V4F64 = fdiv
; AVX512: cost of 2 {{.*}} %V4F64 = fdiv		; AVX512: cost of 28 {{.*}} %V4F64 = fdiv
%V4F64 = fdiv <4 x double> undef, undef		%V4F64 = fdiv <4 x double> undef, undef
; SSE2: cost of 8 {{.*}} %V8F64 = fdiv		; SSE2: cost of 276 {{.*}} %V8F64 = fdiv
; SSE42: cost of 8 {{.*}} %V8F64 = fdiv		; SSE42: cost of 88 {{.*}} %V8F64 = fdiv
; AVX: cost of 4 {{.*}} %V8F64 = fdiv		; AVX: cost of 88 {{.*}} %V8F64 = fdiv
; AVX2: cost of 4 {{.*}} %V8F64 = fdiv		; AVX2: cost of 56 {{.*}} %V8F64 = fdiv
; AVX512: cost of 2 {{.*}} %V8F64 = fdiv		; AVX512: cost of 2 {{.*}} %V8F64 = fdiv
%V8F64 = fdiv <8 x double> undef, undef		%V8F64 = fdiv <8 x double> undef, undef

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'frem'		; CHECK-LABEL: 'frem'
define i32 @frem(i32 %arg) {		define i32 @frem(i32 %arg) {
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	define i32 @frem(i32 %arg) {
; AVX512: cost of 30 {{.*}} %V8F64 = frem		; AVX512: cost of 30 {{.*}} %V8F64 = frem
%V8F64 = frem <8 x double> undef, undef		%V8F64 = frem <8 x double> undef, undef

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'fsqrt'		; CHECK-LABEL: 'fsqrt'
define i32 @fsqrt(i32 %arg) {		define i32 @fsqrt(i32 %arg) {
; SSE2: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; SSE2: cost of 28 {{.*}} %F32 = call float @llvm.sqrt.f32
; SSE42: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; SSE42: cost of 18 {{.*}} %F32 = call float @llvm.sqrt.f32
; AVX: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; AVX: cost of 14 {{.*}} %F32 = call float @llvm.sqrt.f32
; AVX2: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; AVX2: cost of 7 {{.*}} %F32 = call float @llvm.sqrt.f32
; AVX512: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; AVX512: cost of 7 {{.*}} %F32 = call float @llvm.sqrt.f32
%F32 = call float @llvm.sqrt.f32(float undef)		%F32 = call float @llvm.sqrt.f32(float undef)
; SSE2: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; SSE2: cost of 56 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
; SSE42: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; SSE42: cost of 18 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
; AVX: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; AVX: cost of 14 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
; AVX2: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; AVX2: cost of 7 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
; AVX512: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; AVX512: cost of 7 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
%V4F32 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> undef)		%V4F32 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> undef)
; SSE2: cost of 4 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; SSE2: cost of 112 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
; SSE42: cost of 4 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; SSE42: cost of 36 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
; AVX: cost of 1 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; AVX: cost of 28 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
; AVX2: cost of 1 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; AVX2: cost of 14 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
; AVX512: cost of 1 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; AVX512: cost of 14 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
%V8F32 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> undef)		%V8F32 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> undef)
; SSE2: cost of 8 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; SSE2: cost of 224 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
; SSE42: cost of 8 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; SSE42: cost of 72 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
; AVX: cost of 4 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; AVX: cost of 56 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
; AVX2: cost of 4 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; AVX2: cost of 28 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
; AVX512: cost of 1 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; AVX512: cost of 1 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
%V16F32 = call <16 x float> @llvm.sqrt.v16f32(<16 x float> undef)		%V16F32 = call <16 x float> @llvm.sqrt.v16f32(<16 x float> undef)

; SSE2: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; SSE2: cost of 32 {{.*}} %F64 = call double @llvm.sqrt.f64
; SSE42: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; SSE42: cost of 32 {{.*}} %F64 = call double @llvm.sqrt.f64
; AVX: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; AVX: cost of 21 {{.*}} %F64 = call double @llvm.sqrt.f64
; AVX2: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; AVX2: cost of 14 {{.*}} %F64 = call double @llvm.sqrt.f64
; AVX512: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; AVX512: cost of 14 {{.*}} %F64 = call double @llvm.sqrt.f64
%F64 = call double @llvm.sqrt.f64(double undef)		%F64 = call double @llvm.sqrt.f64(double undef)
; SSE2: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; SSE2: cost of 32 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
; SSE42: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; SSE42: cost of 32 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
; AVX: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; AVX: cost of 21 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
; AVX2: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; AVX2: cost of 14 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
; AVX512: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; AVX512: cost of 14 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
%V2F64 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> undef)		%V2F64 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> undef)
; SSE2: cost of 4 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; SSE2: cost of 64 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
; SSE42: cost of 4 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; SSE42: cost of 64 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
; AVX: cost of 1 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; AVX: cost of 43 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
; AVX2: cost of 1 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; AVX2: cost of 28 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
; AVX512: cost of 1 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; AVX512: cost of 28 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
%V4F64 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> undef)		%V4F64 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> undef)
; SSE2: cost of 8 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; SSE2: cost of 128 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
; SSE42: cost of 8 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; SSE42: cost of 128 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
; AVX: cost of 4 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; AVX: cost of 86 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
; AVX2: cost of 4 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; AVX2: cost of 56 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
; AVX512: cost of 1 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; AVX512: cost of 1 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
%V8F64 = call <8 x double> @llvm.sqrt.v8f64(<8 x double> undef)		%V8F64 = call <8 x double> @llvm.sqrt.v8f64(<8 x double> undef)

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'fabs'		; CHECK-LABEL: 'fabs'
define i32 @fabs(i32 %arg) {		define i32 @fabs(i32 %arg) {
▲ Show 20 Lines • Show All 202 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Improved cost model for FDIV and FSQRTClosedPublic

Details

Diff Detail

Event Timeline

Throughput Analysis Report

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 1.0 14.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 0.0 |

- ESP Tracking sync uop was issued

Throughput Analysis Report

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 2.0 28.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 |

- ESP Tracking sync uop was issued

Revision Contents

Diff 76366

lib/Target/X86/X86TargetTransformInfo.cpp

Throughput Analysis Report

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 1.0 14.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 0.0 |

- ESP Tracking sync uop was issued

Throughput Analysis Report

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 2.0 28.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 |

- ESP Tracking sync uop was issued

test/Analysis/CostModel/X86/arith-fp.ll

Improved cost model for FDIV and FSQRT
ClosedPublic