This is an archive of the discontinued LLVM Phabricator instance.

Improved cost model for FDIV and FSQRT
ClosedPublic

Authored by avt77 on Oct 18 2016, 5:01 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
ABataev
mkuper

Commits

rGd07c731d86d1: Improved cost model for FDIV and FSQRT, by Andrew Tischenko
rL285564: Improved cost model for FDIV and FSQRT, by Andrew Tischenko

Summary

There is a bug describing poor cost model for floating point operations: Bug 29083 - [X86][SSE] Improve costs for floating point operations. This patch is the second one in series of patches dealing with cost model.

Diff Detail

Event Timeline

avt77 updated this revision to Diff 74981.Oct 18 2016, 5:01 AM

avt77 retitled this revision from to Improved cost model for FDIV and FSQRT.

avt77 updated this object.

avt77 added reviewers: RKSimon, spatel, ABataev.

I updated cost numbers corresponding to Simon requirements

Cost model numbers related to Pentium and Nehalem were updated.

RKSimon added inline comments.Oct 20 2016, 10:19 AM

lib/Target/X86/X86TargetTransformInfo.cpp
273	AVXCustomCostTable was recently added - you can probably merge these into that table and avoid the extra lookup.
419	Put these above the SDIV/UDIV costs so it doesn't look like they are under the same 'avoid vectorize division' comment, which is purely an integer division thing.
474	if (ST->hasSSE())
1087	Move SSE1FloatCostTable below SSE2CostTbl to keep to 'descending subtarget' convention. You can probably rename it SSE1CostTble as well.
1088	Worth adding a SSE41CostTbl for Core2 era costs?
1175	if (ST->hasSSE())

RKSimon added a reviewer: mkuper.Oct 20 2016, 10:20 AM

RKSimon added a subscriber: llvm-commits.

mkuper added inline comments.Oct 20 2016, 11:00 AM

lib/Target/X86/X86TargetTransformInfo.cpp
269	A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd. I'd expect "2, maybe a bit more", and Agner seems to agree - e.g. for Sandybridge he gives a range of 10-14 cycles for XMM DIVPS, and 20-28 cycles for YMM VDIVPS. Were your IACA checks accounting for additional instructions? Or is this an inconsistency between IACA and Anger's tests? (Note that these numbers are supposed to represent reciprocal throughput, but Agner's data for latency also has factor of ~2)

avt77 added inline comments.Oct 25 2016, 1:12 AM

lib/Target/X86/X86TargetTransformInfo.cpp

269

I use the following numbers:
atischenko@ip-172-31-21-62:~/iaca-lin64/bin$ ./test-arith-fdiv.sh
*********SandyBridge*********
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - arith-fdiv.o
Binary Format - 64Bit
Architecture - SNB
Analysis Type - Throughput

Throughput Analysis Report

Block Throughput: 14.00 Cycles Throughput Bottleneck: Divider

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 1.0 14.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 0.0 |

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred

- instruction micro-ops not bound to a port

^ - Micro Fusion happened

- ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

Num Of	Ports pressure in cycles
Uops	0 - DV	1	2 - D	3 - D	4	5

1.0 14.0

vdivps xmm0, xmm0, xmm0

Total Num Of Uops: 1

atischenko@ip-172-31-21-62:~/iaca-lin64/bin$
./test-arith-fdiv.sh
*********SandyBridge*********
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - arith-fdiv.o
Binary Format - 64Bit
Architecture - SNB
Analysis Type - Throughput

Throughput Analysis Report

Block Throughput: 41.00 Cycles Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 2.0 28.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 |

- instruction micro-ops not bound to a port

^ - Micro Fusion happened

- ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

Num Of	Ports pressure in cycles
Uops	0 - DV	1	2 - D	3 - D	4	5

2.0 28.0

1.0

vdivps ymm0, ymm0, ymm0

Total Num Of Uops: 3

If we use DV value then it's about 2 times. But I used "Block Throughput" value above. As you see my block is exactly one instruction that's why I decided to use "Block Throughput". Maybe I'm wrong? Can anyone suggest me the right decision?

I changed everything except the issue with ymm FDIV. As I wrote in my answer on the comment I'm using Block Throughput from IACA tool. If I'm wrong please say me about and I'll recollect all numbers.

RKSimon added inline comments.Oct 25 2016, 4:28 AM

lib/Target/X86/X86TargetTransformInfo.cpp
1088	Please add Nehalem costs (from Agner) - they're notably better than the P4 default: FSQRT f32/4f32 : 18 f64/2f64 : 32

I'm swaying toward using Agner's numbers for SandyBridge FDIV / FSQRT costs - would anyone have any objections?

In D25722#578473, @RKSimon wrote:

I'm swaying toward using Agner's numbers for SandyBridge FDIV / FSQRT costs - would anyone have any objections?

I'd really like to understand the discrepancy between Agner's numbers and the IACA numbers - especially since IACA does give the expected number for the latency.
Andrew, any chance you could ping the IACA people at Intel and ask them about this?

Meanwhile, I think I'd also prefer this to go in with Agner's numbers - and change it to the IACA numbers later if it turns out it's accounting for something real.

In D25722#578664, @mkuper wrote:

I'd really like to understand the discrepancy between Agner's numbers and the IACA numbers - especially since IACA does give the expected number for the latency.
Andrew, any chance you could ping the IACA people at Intel and ask them about this?

IACA doesn't have that great support these days - the website says development has been suspended, although they have announced they will do a BDW/SKL bugfix release at some point. Similar comments about dodgy costs don't seem to be answered.

In D25722#578712, @RKSimon wrote:

In D25722#578664, @mkuper wrote:

I'd really like to understand the discrepancy between Agner's numbers and the IACA numbers - especially since IACA does give the expected number for the latency.
Andrew, any chance you could ping the IACA people at Intel and ask them about this?

IACA doesn't have that great support these days - the website says development has been suspended, although they have announced they will do a BDW/SKL bugfix release at some point. Similar comments about dodgy costs don't seem to be answered.

To the best of my knowledge, Andrew works for Intel, so he may have a better chance of getting an answer. :-)
As to support - the website has a comment from July that states that "[they] are resuming support for Intel(R) Architecture Code Analyzer with BDW and SKL support probably before end of 2016".

So, one can hope...

jrmuizel added a subscriber: jrmuizel.Oct 25 2016, 1:42 PM

If I understood correctly I should replace all IACA numbers with Agner's numbers, right? OK, I'll do it.
JFYI, I'm not working in Intel since July but of course I know a lot of guys from Intel and I'll try to ask them about IACA future.

avt77 added inline comments.Oct 27 2016, 6:35 AM

lib/Target/X86/X86TargetTransformInfo.cpp
269	Maybe I understood the difference between Anger's and IACA numbers (for SandyBridge at least): Anger's table has numbers for ymm only : VDIVPS y,y,y 3 3 2 1 21-29 20-28 AVX VDIVPS y,y,m256 4 3 2 1 1+ 20-28 AVX but when I played with IACA it showed different types of operands: xmm0, ... and ymm0, ....
269	And another comment: for f32/v4f32 IACA shows 14 for vdivss/vdivps but Agner's table shows the same 10-14 for divss/divps and 20-28 for vdivss/vdivps. Clang generates vdivss/vdivps (for SNB target) that's why I should select 28 as a cost value is we're using Agner's numbers. Is it OK?
269	As an intermediate decision I did the following for SNB numbers: xmm operands use the first number as a cost while ymm operands use the second number. For example: { ISD::FDIV, MVT::f32, 20 }, // SNB from http://www.agner.org/ (IACA: 14) { ISD::FDIV, MVT::v4f32, 20 }, // SNB from http://www.agner.org/ (IACA: 14) { ISD::FDIV, MVT::v8f32, 28 }, // SNB from http://www.agner.org/ (IACA: 41) Is it acceptable? In fact we need here some extension of our current Cost tables: it is not enough to know the instruction itself and the type of its operands. We need some target dependent info as well (e.g. if we're speaking about X86 targets only (exactly our case) then we could add the size of the target registers or something similar, e.g. ISA: SNB/HSW could use both divps and vdivps and usage of 2 divps could be better then usage of one vdivps but our current Cost Model cannot help to decide what's better).
269	BTW, I've just realized that Haswell IACA numbers are really closed to the expected ones: f32 13 vdivss v4f32 13 vdivps xmm0... v8f32 26 vdivps ymm0... f64 20 vdivsd v2f64 20 vdivpd v4f62 47 vdivpd ymm0... It means we have problems with IACA numbers for SNB only (too big difference for xmm and ymm). Maybe new CPUs simply resolved this issue?
1088	JFYI, I got the same numbers for Nehalem with IACA

All numbers from IACA were replaced with Agner's numbers

The wrong SNB numbers were fixed (tnx to Simon Pilgrim)

I think this is almost done now - please can you add the AVX2/Haswell costs:

Haswell
FDIV f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28
FSQRT f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28

Other than that does anyone have any additional feedback?

In D25722#582086, @RKSimon wrote:

I think this is almost done now - please can you add the AVX2/Haswell costs:

Haswell
FDIV f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28
FSQRT f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28

Other than that does anyone have any additional feedback?

No, LGTM (modulo what Simon said about Haswell)

Thanks for fixing this, Andrew.

Haswell numbers added for AVX2

LGTM with one minor

lib/Target/X86/X86TargetTransformInfo.cpp
1127	Better to use the Pentium III costs F32 = 28, VF432 = 56

This revision is now accepted and ready to land.Oct 29 2016, 4:36 AM

FSQRT changes: SSE1 Cost table updated with Pentium III numbers; SSE42 cost table added with Nehalem numbers

Closed by commit rL285564: Improved cost model for FDIV and FSQRT, by Andrew Tischenko (authored by ABataev). · Explain WhyOct 31 2016, 5:20 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

X86TargetTransformInfo.cpp

60 lines

test/

Analysis/

CostModel/

X86/

arith-fp.ll

164 lines

Diff 75284

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 257 Lines • ▼ Show 20 Lines

int X86TTIImpl::getArithmeticInstrCost(

// Look for AVX2 lowering tricks for custom cases.

if (ST->hasAVX2()) {

if (const auto *Entry = CostTableLookup(AVX2CustomCostTable, ISD,

LT.second))

return LT.first * Entry->Cost;

}

static const CostTblEntry AVXFloatCostTable[] = {

{ ISD::FDIV, MVT::f32, 14 }, // IACA value for SandyBridge arch

{ ISD::FDIV, MVT::v4f32, 14 }, // IACA value for SandyBridge arch

{ ISD::FDIV, MVT::v8f32, 41 }, // IACA value for SandyBridge arch

mkuperUnsubmitted

Not Done

A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd.

I'd expect "2, maybe a bit more", and Agner seems to agree - e.g. for Sandybridge he gives a range of 10-14 cycles for XMM DIVPS, and 20-28 cycles for YMM VDIVPS. Were your IACA checks accounting for additional instructions? Or is this an inconsistency between IACA and Anger's tests?

(Note that these numbers are supposed to represent reciprocal throughput, but Agner's data for latency also has factor of ~2)

mkuper: A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd. I'd expect "2, maybe a…

avt77AuthorUnsubmitted

Not Done

Throughput Analysis Report

Block Throughput: 14.00 Cycles Throughput Bottleneck: Divider

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 1.0 14.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 0.0 |

- instruction micro-ops not bound to a port

^ - Micro Fusion happened

- ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

Num Of	Ports pressure in cycles
Uops	0 - DV	1	2 - D	3 - D	4	5

1.0 14.0

vdivps xmm0, xmm0, xmm0

Total Num Of Uops: 1

Throughput Analysis Report

Block Throughput: 41.00 Cycles Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 2.0 28.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 |

- instruction micro-ops not bound to a port

^ - Micro Fusion happened

- ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

Num Of	Ports pressure in cycles
Uops	0 - DV	1	2 - D	3 - D	4	5

2.0 28.0

1.0

vdivps ymm0, ymm0, ymm0

Total Num Of Uops: 3

avt77: I use the following numbers: atischenko@ip-172-31-21-62:~/iaca-lin64/bin$ ./test-arith-fdiv.sh…

avt77AuthorUnsubmitted

Not Done

Maybe I understood the difference between Anger's and IACA numbers (for SandyBridge at least): Anger's table has numbers for ymm only :
VDIVPS y,y,y 3 3 2 1 21-29 20-28 AVX
VDIVPS y,y,m256 4 3 2 1 1+ 20-28 AVX

but when I played with IACA it showed different types of operands: xmm0, ... and ymm0, ....

avt77: Maybe I understood the difference between Anger's and IACA numbers (for SandyBridge at least)…

avt77AuthorUnsubmitted

Not Done

And another comment: for f32/v4f32 IACA shows 14 for vdivss/vdivps but Agner's table shows the same 10-14 for divss/divps and 20-28 for vdivss/vdivps. Clang generates vdivss/vdivps (for SNB target) that's why I should select 28 as a cost value is we're using Agner's numbers. Is it OK?

avt77: And another comment: for f32/v4f32 IACA shows 14 for vdivss/vdivps but Agner's table shows the…

avt77AuthorUnsubmitted

Not Done

As an intermediate decision I did the following for SNB numbers: xmm operands use the first number as a cost while ymm operands use the second number. For example:

{ ISD::FDIV, MVT::f32,   20 }, // SNB from http://www.agner.org/ (IACA: 14)
{ ISD::FDIV, MVT::v4f32, 20 }, // SNB from http://www.agner.org/ (IACA: 14)
{ ISD::FDIV, MVT::v8f32, 28 }, // SNB from http://www.agner.org/ (IACA: 41)

Is it acceptable? In fact we need here some extension of our current Cost tables: it is not enough to know the instruction itself and the type of its operands. We need some target dependent info as well (e.g. if we're speaking about X86 targets only (exactly our case) then we could add the size of the target registers or something similar, e.g. ISA: SNB/HSW could use both divps and vdivps and usage of 2 divps could be better then usage of one vdivps but our current Cost Model cannot help to decide what's better).

avt77: As an intermediate decision I did the following for SNB numbers: xmm operands use the first…

avt77AuthorUnsubmitted

Not Done

BTW, I've just realized that Haswell IACA numbers are really closed to the expected ones:

f32 13 vdivss
v4f32 13 vdivps xmm0...
v8f32 26 vdivps ymm0...
f64 20 vdivsd
v2f64 20 vdivpd
v4f62 47 vdivpd ymm0...

It means we have problems with IACA numbers for SNB only (too big difference for xmm and ymm). Maybe new CPUs simply resolved this issue?

avt77: BTW, I've just realized that Haswell IACA numbers are really closed to the expected ones: f32…

{ ISD::FDIV, MVT::f64, 21 }, // IACA value for SandyBridge arch

{ ISD::FDIV, MVT::v2f64, 21 }, // IACA value for SandyBridge arch

{ ISD::FDIV, MVT::v4f64, 62 }, // IACA value for SandyBridge arch

};

RKSimonUnsubmitted

Not Done

AVXCustomCostTable was recently added - you can probably merge these into that table and avoid the extra lookup.

RKSimon: AVXCustomCostTable was recently added - you can probably merge these into that table and avoid…

if (ST->hasAVX()) {

if (const auto *Entry = CostTableLookup(AVXFloatCostTable, ISD,

LT.second))

return LT.first * Entry->Cost;

}

static const CostTblEntry SSE42FloatCostTable[] = {

{ ISD::FDIV, MVT::f32, 14 }, // IACA value for Nehalem arch

{ ISD::FDIV, MVT::v4f32, 14 }, // IACA value for Nehalem arch

{ ISD::FDIV, MVT::f64, 21 }, // IACA value for Nehalem arch

{ ISD::FDIV, MVT::v2f64, 21 }, // IACA value for Nehalem arch

};

if (ST->hasSSE42()) {

if (const auto *Entry = CostTableLookup(SSE42FloatCostTable, ISD,

LT.second))

return LT.first * Entry->Cost;

}

static const CostTblEntry

SSE2UniformConstCostTable[] = {

// Constant splats are cheaper for the following instructions.

{ ISD::SDIV, MVT::v8i16, 6 }, // pmulhw sequence

{ ISD::UDIV, MVT::v8i16, 6 }, // pmulhuw sequence

{ ISD::SDIV, MVT::v4i32, 19 }, // pmuludq sequence

{ ISD::UDIV, MVT::v4i32, 15 }, // pmuludq sequence

};

▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines

static const CostTblEntry SSE2CostTable[] = {

{ ISD::SDIV, MVT::v16i8, 16*20 },

{ ISD::SDIV, MVT::v8i16, 8*20 },

{ ISD::SDIV, MVT::v4i32, 4*20 },

{ ISD::SDIV, MVT::v2i64, 2*20 },

{ ISD::UDIV, MVT::v16i8, 16*20 },

{ ISD::UDIV, MVT::v8i16, 8*20 },

{ ISD::UDIV, MVT::v4i32, 4*20 },

{ ISD::UDIV, MVT::v2i64, 2*20 },

{ ISD::FDIV, MVT::f32, 23 }, // Pentium IV from http://www.agner.org/

{ ISD::FDIV, MVT::v4f32, 39 }, // Pentium IV from http://www.agner.org/

{ ISD::FDIV, MVT::f64, 38 }, // Pentium IV from http://www.agner.org/

{ ISD::FDIV, MVT::v2f64, 69 }, // Pentium IV from http://www.agner.org/

RKSimonUnsubmitted

Not Done

Put these above the SDIV/UDIV costs so it doesn't look like they are under the same 'avoid vectorize division' comment, which is purely an integer division thing.

RKSimon: Put these above the SDIV/UDIV costs so it doesn't look like they are under the same 'avoid…

};

if (ST->hasSSE2()) {

if (const auto *Entry = CostTableLookup(SSE2CostTable, ISD, LT.second))

return LT.first * Entry->Cost;

}

static const CostTblEntry AVX1CostTable[] = {

Show All 33 Lines

if (const auto *Entry = CostTableLookup(CustomLowered, ISD, LT.second))

return LT.first * Entry->Cost;

// Special lowering of v4i32 mul on sse2, sse3: Lower v4i32 mul as 2x shuffle,

// 2x pmuludq, 2x shuffle.

if (ISD == ISD::MUL && LT.second == MVT::v4i32 && ST->hasSSE2() &&

!ST->hasSSE41())

return LT.first * 6;

static const CostTblEntry SSE1FloatCostTable[] = {

{ ISD::FDIV, MVT::f32, 17 }, // Pentium III from http://www.agner.org/

{ ISD::FDIV, MVT::v4f32, 34 }, // Pentium III from http://www.agner.org/

};

if (const auto *Entry = CostTableLookup(SSE1FloatCostTable, ISD,

RKSimonUnsubmitted

Not Done

if (ST->hasSSE())

RKSimon: if (ST->hasSSE())

LT.second))

return LT.first * Entry->Cost;

// Fallback to the default implementation.

return BaseT::getArithmeticInstrCost(Opcode, Ty, Op1Info, Op2Info);

}

int X86TTIImpl::getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,

Type *SubTp) {

// We only estimate the cost of reverse and alternate shuffles.

if (Kind != TTI::SK_Reverse && Kind != TTI::SK_Alternate)

▲ Show 20 Lines • Show All 584 Lines • ▼ Show 20 Lines

static const CostTblEntry AVX1CostTbl[] = {

{ ISD::CTPOP, MVT::v4i64, 14 },

{ ISD::CTPOP, MVT::v8i32, 22 },

{ ISD::CTPOP, MVT::v16i16, 18 },

{ ISD::CTPOP, MVT::v32i8, 12 },

{ ISD::CTTZ, MVT::v4i64, 20 },

{ ISD::CTTZ, MVT::v8i32, 28 },

{ ISD::CTTZ, MVT::v16i16, 24 },

{ ISD::CTTZ, MVT::v32i8, 18 },

{ ISD::FSQRT, MVT::f32, 14 }, // IACA value for SandyBridge arch

{ ISD::FSQRT, MVT::v4f32, 14 }, // IACA value for SandyBridge arch

{ ISD::FSQRT, MVT::v8f32, 41 }, // IACA value for SandyBridge arch

{ ISD::FSQRT, MVT::f64, 21 }, // IACA value for SandyBridge arch

{ ISD::FSQRT, MVT::v2f64, 21 }, // IACA value for SandyBridge arch

{ ISD::FSQRT, MVT::v4f64, 62 }, // IACA value for SandyBridge arch

};

static const CostTblEntry SSE1FloatCostTable[] = {

{ ISD::FSQRT, MVT::f32, 28 }, // Pentium III from http://www.agner.org/

{ ISD::FSQRT, MVT::v4f32, 56 }, // Pentium III from http://www.agner.org/

};

RKSimonUnsubmitted

Not Done

Move SSE1FloatCostTable below SSE2CostTbl to keep to 'descending subtarget' convention. You can probably rename it SSE1CostTble as well.

RKSimon: Move SSE1FloatCostTable below SSE2CostTbl to keep to 'descending subtarget' convention. You can…

static const CostTblEntry SSSE3CostTbl[] = {

RKSimonUnsubmitted

Not Done

Worth adding a SSE41CostTbl for Core2 era costs?

RKSimon: Worth adding a SSE41CostTbl for Core2 era costs?

RKSimonUnsubmitted

Not Done

Please add Nehalem costs (from Agner) - they're notably better than the P4 default:

FSQRT f32/4f32 : 18 f64/2f64 : 32

RKSimon: Please add Nehalem costs (from Agner) - they're notably better than the P4 default: FSQRT…

avt77AuthorUnsubmitted

Not Done

JFYI, I got the same numbers for Nehalem with IACA

avt77: JFYI, I got the same numbers for Nehalem with IACA

{ ISD::BITREVERSE, MVT::v2i64, 5 },

{ ISD::BITREVERSE, MVT::v4i32, 5 },

{ ISD::BITREVERSE, MVT::v8i16, 5 },

{ ISD::BITREVERSE, MVT::v16i8, 5 },

{ ISD::BSWAP, MVT::v2i64, 1 },

{ ISD::BSWAP, MVT::v4i32, 1 },

{ ISD::BSWAP, MVT::v8i16, 1 },

{ ISD::CTLZ, MVT::v2i64, 23 },

Show All 16 Lines

static const CostTblEntry SSE2CostTbl[] = {

/* ISD::CTLZ - currently scalarized pre-SSSE3 */

{ ISD::CTPOP, MVT::v2i64, 12 },

{ ISD::CTPOP, MVT::v4i32, 15 },

{ ISD::CTPOP, MVT::v8i16, 13 },

{ ISD::CTPOP, MVT::v16i8, 10 },

{ ISD::CTTZ, MVT::v2i64, 14 },

{ ISD::CTTZ, MVT::v4i32, 18 },

{ ISD::CTTZ, MVT::v8i16, 16 },

{ ISD::CTTZ, MVT::v16i8, 13 }

{ ISD::CTTZ, MVT::v16i8, 13 },

{ ISD::FSQRT, MVT::f64, 38 }, // Pentium IV from http://www.agner.org/

{ ISD::FSQRT, MVT::v2f64, 69 }, // Pentium IV from http://www.agner.org/

};

unsigned ISD = ISD::DELETED_NODE;

switch (IID) {

RKSimonUnsubmitted

Not Done

Better to use the Pentium III costs F32 = 28, VF432 = 56

RKSimon: Better to use the Pentium III costs F32 = 28, VF432 = 56

default:

break;

case Intrinsic::bitreverse:

ISD = ISD::BITREVERSE;

break;

case Intrinsic::bswap:

ISD = ISD::BSWAP;

break;

case Intrinsic::ctlz:

ISD = ISD::CTLZ;

break;

case Intrinsic::ctpop:

ISD = ISD::CTPOP;

break;

case Intrinsic::cttz:

ISD = ISD::CTTZ;

break;

case Intrinsic::sqrt:

ISD = ISD::FSQRT;

break;

}

// Legalize the type.

std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, RetTy);

MVT MTy = LT.second;

// Attempt to lookup cost.

if (ST->hasXOP())

Show All 11 Lines

int X86TTIImpl::getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,

if (ST->hasSSSE3())

if (const auto *Entry = CostTableLookup(SSSE3CostTbl, ISD, MTy))

return LT.first * Entry->Cost;

if (ST->hasSSE2())

if (const auto *Entry = CostTableLookup(SSE2CostTbl, ISD, MTy))

return LT.first * Entry->Cost;

if (const auto *Entry = CostTableLookup(SSE1FloatCostTable, ISD, MTy))

RKSimonUnsubmitted

Not Done

if (ST->hasSSE())

RKSimon: if (ST->hasSSE())

return LT.first * Entry->Cost;

return BaseT::getIntrinsicInstrCost(IID, RetTy, Tys, FMF);

}

int X86TTIImpl::getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,

ArrayRef<Value *> Args, FastMathFlags FMF) {

return BaseT::getIntrinsicInstrCost(IID, RetTy, Args, FMF);

}

▲ Show 20 Lines • Show All 583 Lines • Show Last 20 Lines

test/Analysis/CostModel/X86/arith-fp.ll

; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE2		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE2
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse4.2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE42		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+sse4.2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE42
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx2,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX2		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx2,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX2
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512F		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512F
; RUN: opt < %s -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f,+avx512bw \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512BW		; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze -mtriple=x86_64-apple-macosx10.8.0 -mattr=+avx512f,+avx512bw \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512BW

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"		target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.8.0"		target triple = "x86_64-apple-macosx10.8.0"

; CHECK-LABEL: 'fadd'		; CHECK-LABEL: 'fadd'
define i32 @fadd(i32 %arg) {		define i32 @fadd(i32 %arg) {
; SSE2: cost of 2 {{.*}} %F32 = fadd		; SSE2: cost of 2 {{.*}} %F32 = fadd
; SSE42: cost of 2 {{.*}} %F32 = fadd		; SSE42: cost of 2 {{.*}} %F32 = fadd
▲ Show 20 Lines • Show All 155 Lines • ▼ Show 20 Lines	define i32 @fmul(i32 %arg) {
; AVX512: cost of 2 {{.*}} %V8F64 = fmul		; AVX512: cost of 2 {{.*}} %V8F64 = fmul
%V8F64 = fmul <8 x double> undef, undef		%V8F64 = fmul <8 x double> undef, undef

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'fdiv'		; CHECK-LABEL: 'fdiv'
define i32 @fdiv(i32 %arg) {		define i32 @fdiv(i32 %arg) {
; SSE2: cost of 2 {{.*}} %F32 = fdiv		; SSE2: cost of 23 {{.*}} %F32 = fdiv
; SSE42: cost of 2 {{.*}} %F32 = fdiv		; SSE42: cost of 14 {{.*}} %F32 = fdiv
; AVX: cost of 2 {{.*}} %F32 = fdiv		; AVX: cost of 14 {{.*}} %F32 = fdiv
; AVX2: cost of 2 {{.*}} %F32 = fdiv		; AVX2: cost of 14 {{.*}} %F32 = fdiv
; AVX512: cost of 2 {{.*}} %F32 = fdiv		; AVX512: cost of 14 {{.*}} %F32 = fdiv
%F32 = fdiv float undef, undef		%F32 = fdiv float undef, undef
; SSE2: cost of 2 {{.*}} %V4F32 = fdiv		; SSE2: cost of 39 {{.*}} %V4F32 = fdiv
; SSE42: cost of 2 {{.*}} %V4F32 = fdiv		; SSE42: cost of 14 {{.*}} %V4F32 = fdiv
; AVX: cost of 2 {{.*}} %V4F32 = fdiv		; AVX: cost of 14 {{.*}} %V4F32 = fdiv
; AVX2: cost of 2 {{.*}} %V4F32 = fdiv		; AVX2: cost of 14 {{.*}} %V4F32 = fdiv
; AVX512: cost of 2 {{.*}} %V4F32 = fdiv		; AVX512: cost of 14 {{.*}} %V4F32 = fdiv
%V4F32 = fdiv <4 x float> undef, undef		%V4F32 = fdiv <4 x float> undef, undef
; SSE2: cost of 4 {{.*}} %V8F32 = fdiv		; SSE2: cost of 78 {{.*}} %V8F32 = fdiv
; SSE42: cost of 4 {{.*}} %V8F32 = fdiv		; SSE42: cost of 28 {{.*}} %V8F32 = fdiv
; AVX: cost of 2 {{.*}} %V8F32 = fdiv		; AVX: cost of 41 {{.*}} %V8F32 = fdiv
; AVX2: cost of 2 {{.*}} %V8F32 = fdiv		; AVX2: cost of 41 {{.*}} %V8F32 = fdiv
; AVX512: cost of 2 {{.*}} %V8F32 = fdiv		; AVX512: cost of 41 {{.*}} %V8F32 = fdiv
%V8F32 = fdiv <8 x float> undef, undef		%V8F32 = fdiv <8 x float> undef, undef
; SSE2: cost of 8 {{.*}} %V16F32 = fdiv		; SSE2: cost of 156 {{.*}} %V16F32 = fdiv
; SSE42: cost of 8 {{.*}} %V16F32 = fdiv		; SSE42: cost of 56 {{.*}} %V16F32 = fdiv
; AVX: cost of 4 {{.*}} %V16F32 = fdiv		; AVX: cost of 82 {{.*}} %V16F32 = fdiv
; AVX2: cost of 4 {{.*}} %V16F32 = fdiv		; AVX2: cost of 82 {{.*}} %V16F32 = fdiv
; AVX512: cost of 2 {{.*}} %V16F32 = fdiv		; AVX512: cost of 2 {{.*}} %V16F32 = fdiv
%V16F32 = fdiv <16 x float> undef, undef		%V16F32 = fdiv <16 x float> undef, undef

; SSE2: cost of 2 {{.*}} %F64 = fdiv		; SSE2: cost of 38 {{.*}} %F64 = fdiv
; SSE42: cost of 2 {{.*}} %F64 = fdiv		; SSE42: cost of 21 {{.*}} %F64 = fdiv
; AVX: cost of 2 {{.*}} %F64 = fdiv		; AVX: cost of 21 {{.*}} %F64 = fdiv
; AVX2: cost of 2 {{.*}} %F64 = fdiv		; AVX2: cost of 21 {{.*}} %F64 = fdiv
; AVX512: cost of 2 {{.*}} %F64 = fdiv		; AVX512: cost of 21 {{.*}} %F64 = fdiv
%F64 = fdiv double undef, undef		%F64 = fdiv double undef, undef
; SSE2: cost of 2 {{.*}} %V2F64 = fdiv		; SSE2: cost of 69 {{.*}} %V2F64 = fdiv
; SSE42: cost of 2 {{.*}} %V2F64 = fdiv		; SSE42: cost of 21 {{.*}} %V2F64 = fdiv
; AVX: cost of 2 {{.*}} %V2F64 = fdiv		; AVX: cost of 21 {{.*}} %V2F64 = fdiv
; AVX2: cost of 2 {{.*}} %V2F64 = fdiv		; AVX2: cost of 21 {{.*}} %V2F64 = fdiv
; AVX512: cost of 2 {{.*}} %V2F64 = fdiv		; AVX512: cost of 21 {{.*}} %V2F64 = fdiv
%V2F64 = fdiv <2 x double> undef, undef		%V2F64 = fdiv <2 x double> undef, undef
; SSE2: cost of 4 {{.*}} %V4F64 = fdiv		; SSE2: cost of 138 {{.*}} %V4F64 = fdiv
; SSE42: cost of 4 {{.*}} %V4F64 = fdiv		; SSE42: cost of 42 {{.*}} %V4F64 = fdiv
; AVX: cost of 2 {{.*}} %V4F64 = fdiv		; AVX: cost of 62 {{.*}} %V4F64 = fdiv
; AVX2: cost of 2 {{.*}} %V4F64 = fdiv		; AVX2: cost of 62 {{.*}} %V4F64 = fdiv
; AVX512: cost of 2 {{.*}} %V4F64 = fdiv		; AVX512: cost of 62 {{.*}} %V4F64 = fdiv
%V4F64 = fdiv <4 x double> undef, undef		%V4F64 = fdiv <4 x double> undef, undef
; SSE2: cost of 8 {{.*}} %V8F64 = fdiv		; SSE2: cost of 276 {{.*}} %V8F64 = fdiv
; SSE42: cost of 8 {{.*}} %V8F64 = fdiv		; SSE42: cost of 84 {{.*}} %V8F64 = fdiv
; AVX: cost of 4 {{.*}} %V8F64 = fdiv		; AVX: cost of 124 {{.*}} %V8F64 = fdiv
; AVX2: cost of 4 {{.*}} %V8F64 = fdiv		; AVX2: cost of 124 {{.*}} %V8F64 = fdiv
; AVX512: cost of 2 {{.*}} %V8F64 = fdiv		; AVX512: cost of 2 {{.*}} %V8F64 = fdiv
%V8F64 = fdiv <8 x double> undef, undef		%V8F64 = fdiv <8 x double> undef, undef

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'frem'		; CHECK-LABEL: 'frem'
define i32 @frem(i32 %arg) {		define i32 @frem(i32 %arg) {
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	define i32 @frem(i32 %arg) {
; AVX512: cost of 30 {{.*}} %V8F64 = frem		; AVX512: cost of 30 {{.*}} %V8F64 = frem
%V8F64 = frem <8 x double> undef, undef		%V8F64 = frem <8 x double> undef, undef

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'fsqrt'		; CHECK-LABEL: 'fsqrt'
define i32 @fsqrt(i32 %arg) {		define i32 @fsqrt(i32 %arg) {
; SSE2: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; SSE2: cost of 28 {{.*}} %F32 = call float @llvm.sqrt.f32
; SSE42: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; SSE42: cost of 28 {{.*}} %F32 = call float @llvm.sqrt.f32
; AVX: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; AVX: cost of 14 {{.*}} %F32 = call float @llvm.sqrt.f32
; AVX2: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; AVX2: cost of 14 {{.*}} %F32 = call float @llvm.sqrt.f32
; AVX512: cost of 1 {{.*}} %F32 = call float @llvm.sqrt.f32		; AVX512: cost of 14 {{.*}} %F32 = call float @llvm.sqrt.f32
%F32 = call float @llvm.sqrt.f32(float undef)		%F32 = call float @llvm.sqrt.f32(float undef)
; SSE2: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; SSE2: cost of 56 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
; SSE42: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; SSE42: cost of 56 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
; AVX: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; AVX: cost of 14 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
; AVX2: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; AVX2: cost of 14 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
; AVX512: cost of 1 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32		; AVX512: cost of 14 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32
%V4F32 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> undef)		%V4F32 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> undef)
; SSE2: cost of 4 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; SSE2: cost of 112 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
; SSE42: cost of 4 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; SSE42: cost of 112 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
; AVX: cost of 1 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; AVX: cost of 41 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
; AVX2: cost of 1 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; AVX2: cost of 41 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
; AVX512: cost of 1 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32		; AVX512: cost of 41 {{.*}} %V8F32 = call <8 x float> @llvm.sqrt.v8f32
%V8F32 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> undef)		%V8F32 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> undef)
; SSE2: cost of 8 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; SSE2: cost of 224 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
; SSE42: cost of 8 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; SSE42: cost of 224 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
; AVX: cost of 4 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; AVX: cost of 82 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
; AVX2: cost of 4 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; AVX2: cost of 82 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
; AVX512: cost of 1 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32		; AVX512: cost of 1 {{.*}} %V16F32 = call <16 x float> @llvm.sqrt.v16f32
%V16F32 = call <16 x float> @llvm.sqrt.v16f32(<16 x float> undef)		%V16F32 = call <16 x float> @llvm.sqrt.v16f32(<16 x float> undef)

; SSE2: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; SSE2: cost of 38 {{.*}} %F64 = call double @llvm.sqrt.f64
; SSE42: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; SSE42: cost of 38 {{.*}} %F64 = call double @llvm.sqrt.f64
; AVX: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; AVX: cost of 21 {{.*}} %F64 = call double @llvm.sqrt.f64
; AVX2: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; AVX2: cost of 21 {{.*}} %F64 = call double @llvm.sqrt.f64
; AVX512: cost of 1 {{.*}} %F64 = call double @llvm.sqrt.f64		; AVX512: cost of 21 {{.*}} %F64 = call double @llvm.sqrt.f64
%F64 = call double @llvm.sqrt.f64(double undef)		%F64 = call double @llvm.sqrt.f64(double undef)
; SSE2: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; SSE2: cost of 69 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
; SSE42: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; SSE42: cost of 69 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
; AVX: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; AVX: cost of 21 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
; AVX2: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; AVX2: cost of 21 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
; AVX512: cost of 1 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64		; AVX512: cost of 21 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64
%V2F64 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> undef)		%V2F64 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> undef)
; SSE2: cost of 4 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; SSE2: cost of 138 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
; SSE42: cost of 4 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; SSE42: cost of 138 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
; AVX: cost of 1 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; AVX: cost of 62 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
; AVX2: cost of 1 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; AVX2: cost of 62 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
; AVX512: cost of 1 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64		; AVX512: cost of 62 {{.*}} %V4F64 = call <4 x double> @llvm.sqrt.v4f64
%V4F64 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> undef)		%V4F64 = call <4 x double> @llvm.sqrt.v4f64(<4 x double> undef)
; SSE2: cost of 8 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; SSE2: cost of 276 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
; SSE42: cost of 8 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; SSE42: cost of 276 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
; AVX: cost of 4 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; AVX: cost of 124 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
; AVX2: cost of 4 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; AVX2: cost of 124 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
; AVX512: cost of 1 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64		; AVX512: cost of 1 {{.*}} %V8F64 = call <8 x double> @llvm.sqrt.v8f64
%V8F64 = call <8 x double> @llvm.sqrt.v8f64(<8 x double> undef)		%V8F64 = call <8 x double> @llvm.sqrt.v8f64(<8 x double> undef)

ret i32 undef		ret i32 undef
}		}

; CHECK-LABEL: 'fabs'		; CHECK-LABEL: 'fabs'
define i32 @fabs(i32 %arg) {		define i32 @fabs(i32 %arg) {
▲ Show 20 Lines • Show All 202 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Improved cost model for FDIV and FSQRTClosedPublic

Details

Diff Detail

Event Timeline

Throughput Analysis Report

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 1.0 14.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 0.0 |

- ESP Tracking sync uop was issued

Throughput Analysis Report

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 2.0 28.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 |

- ESP Tracking sync uop was issued

Revision Contents

Diff 75284

lib/Target/X86/X86TargetTransformInfo.cpp

Throughput Analysis Report

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 1.0 14.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 0.0 |

- ESP Tracking sync uop was issued

Throughput Analysis Report

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 2.0 28.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 |

- ESP Tracking sync uop was issued

test/Analysis/CostModel/X86/arith-fp.ll

Improved cost model for FDIV and FSQRT
ClosedPublic