This is an archive of the discontinued LLVM Phabricator instance.

Improved cost model for FDIV and FSQRT
ClosedPublic

Authored by avt77 on Oct 18 2016, 5:01 AM.

Details

Summary

There is a bug describing poor cost model for floating point operations: Bug 29083 - [X86][SSE] Improve costs for floating point operations. This patch is the second one in series of patches dealing with cost model.

Diff Detail

Event Timeline

avt77 updated this revision to Diff 74981.Oct 18 2016, 5:01 AM
avt77 retitled this revision from to Improved cost model for FDIV and FSQRT.
avt77 updated this object.
avt77 added reviewers: RKSimon, spatel, ABataev.
avt77 updated this revision to Diff 75129.Oct 19 2016, 4:49 AM

I updated cost numbers corresponding to Simon requirements

avt77 updated this revision to Diff 75284.EditedOct 20 2016, 4:35 AM

Cost model numbers related to Pentium and Nehalem were updated.

RKSimon added inline comments.Oct 20 2016, 10:19 AM
lib/Target/X86/X86TargetTransformInfo.cpp
273

AVXCustomCostTable was recently added - you can probably merge these into that table and avoid the extra lookup.

419

Put these above the SDIV/UDIV costs so it doesn't look like they are under the same 'avoid vectorize division' comment, which is purely an integer division thing.

474

if (ST->hasSSE())

1087

Move SSE1FloatCostTable below SSE2CostTbl to keep to 'descending subtarget' convention. You can probably rename it SSE1CostTble as well.

1088

Worth adding a SSE41CostTbl for Core2 era costs?

1175

if (ST->hasSSE())

RKSimon added a subscriber: llvm-commits.
mkuper added inline comments.Oct 20 2016, 11:00 AM
lib/Target/X86/X86TargetTransformInfo.cpp
269

A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd.

I'd expect "2, maybe a bit more", and Agner seems to agree - e.g. for Sandybridge he gives a range of 10-14 cycles for XMM DIVPS, and 20-28 cycles for YMM VDIVPS. Were your IACA checks accounting for additional instructions? Or is this an inconsistency between IACA and Anger's tests?

(Note that these numbers are supposed to represent reciprocal throughput, but Agner's data for latency also has factor of ~2)

avt77 added inline comments.Oct 25 2016, 1:12 AM
lib/Target/X86/X86TargetTransformInfo.cpp
269

I use the following numbers:
atischenko@ip-172-31-21-62:~/iaca-lin64/bin$ ./test-arith-fdiv.sh
*********SandyBridge*********
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - arith-fdiv.o
Binary Format - 64Bit
Architecture - SNB
Analysis Type - Throughput

Throughput Analysis Report

Block Throughput: 14.00 Cycles Throughput Bottleneck: Divider

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 1.0 14.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 0.0 |

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred

  • - instruction micro-ops not bound to a port

^ - Micro Fusion happened

- ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

Num OfPorts pressure in cycles
Uops0 - DV12 - D3 - D45

11.0 14.0CPvdivps xmm0, xmm0, xmm0

Total Num Of Uops: 1

atischenko@ip-172-31-21-62:~/iaca-lin64/bin$
./test-arith-fdiv.sh
*********SandyBridge*********
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - arith-fdiv.o
Binary Format - 64Bit
Architecture - SNB
Analysis Type - Throughput

Throughput Analysis Report

Block Throughput: 41.00 Cycles Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:

| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |

| Cycles | 2.0 28.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 |

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred

  • - instruction micro-ops not bound to a port

^ - Micro Fusion happened

- ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

Num OfPorts pressure in cycles
Uops0 - DV12 - D3 - D45

32.0 28.01.0CPvdivps ymm0, ymm0, ymm0

Total Num Of Uops: 3

If we use DV value then it's about 2 times. But I used "Block Throughput" value above. As you see my block is exactly one instruction that's why I decided to use "Block Throughput". Maybe I'm wrong? Can anyone suggest me the right decision?

avt77 updated this revision to Diff 75679.Oct 25 2016, 3:06 AM

I changed everything except the issue with ymm FDIV. As I wrote in my answer on the comment I'm using Block Throughput from IACA tool. If I'm wrong please say me about and I'll recollect all numbers.

RKSimon added inline comments.Oct 25 2016, 4:28 AM
lib/Target/X86/X86TargetTransformInfo.cpp
1088

Please add Nehalem costs (from Agner) - they're notably better than the P4 default:

FSQRT f32/4f32 : 18 f64/2f64 : 32

RKSimon edited edge metadata.Oct 25 2016, 8:03 AM

I'm swaying toward using Agner's numbers for SandyBridge FDIV / FSQRT costs - would anyone have any objections?

mkuper edited edge metadata.Oct 25 2016, 10:27 AM

I'm swaying toward using Agner's numbers for SandyBridge FDIV / FSQRT costs - would anyone have any objections?

I'd really like to understand the discrepancy between Agner's numbers and the IACA numbers - especially since IACA does give the expected number for the latency.
Andrew, any chance you could ping the IACA people at Intel and ask them about this?

Meanwhile, I think I'd also prefer this to go in with Agner's numbers - and change it to the IACA numbers later if it turns out it's accounting for something real.

I'd really like to understand the discrepancy between Agner's numbers and the IACA numbers - especially since IACA does give the expected number for the latency.
Andrew, any chance you could ping the IACA people at Intel and ask them about this?

IACA doesn't have that great support these days - the website says development has been suspended, although they have announced they will do a BDW/SKL bugfix release at some point. Similar comments about dodgy costs don't seem to be answered.

I'd really like to understand the discrepancy between Agner's numbers and the IACA numbers - especially since IACA does give the expected number for the latency.
Andrew, any chance you could ping the IACA people at Intel and ask them about this?

IACA doesn't have that great support these days - the website says development has been suspended, although they have announced they will do a BDW/SKL bugfix release at some point. Similar comments about dodgy costs don't seem to be answered.

To the best of my knowledge, Andrew works for Intel, so he may have a better chance of getting an answer. :-)
As to support - the website has a comment from July that states that "[they] are resuming support for Intel(R) Architecture Code Analyzer with BDW and SKL support probably before end of 2016".

So, one can hope...

If I understood correctly I should replace all IACA numbers with Agner's numbers, right? OK, I'll do it.
JFYI, I'm not working in Intel since July but of course I know a lot of guys from Intel and I'll try to ask them about IACA future.

avt77 added inline comments.Oct 27 2016, 6:35 AM
lib/Target/X86/X86TargetTransformInfo.cpp
269

Maybe I understood the difference between Anger's and IACA numbers (for SandyBridge at least): Anger's table has numbers for ymm only :
VDIVPS y,y,y 3 3 2 1 21-29 20-28 AVX
VDIVPS y,y,m256 4 3 2 1 1+ 20-28 AVX

but when I played with IACA it showed different types of operands: xmm0, ... and ymm0, ....

269

And another comment: for f32/v4f32 IACA shows 14 for vdivss/vdivps but Agner's table shows the same 10-14 for divss/divps and 20-28 for vdivss/vdivps. Clang generates vdivss/vdivps (for SNB target) that's why I should select 28 as a cost value is we're using Agner's numbers. Is it OK?

269

As an intermediate decision I did the following for SNB numbers: xmm operands use the first number as a cost while ymm operands use the second number. For example:

{ ISD::FDIV, MVT::f32,   20 }, // SNB from http://www.agner.org/ (IACA: 14)
{ ISD::FDIV, MVT::v4f32, 20 }, // SNB from http://www.agner.org/ (IACA: 14)
{ ISD::FDIV, MVT::v8f32, 28 }, // SNB from http://www.agner.org/ (IACA: 41)

Is it acceptable? In fact we need here some extension of our current Cost tables: it is not enough to know the instruction itself and the type of its operands. We need some target dependent info as well (e.g. if we're speaking about X86 targets only (exactly our case) then we could add the size of the target registers or something similar, e.g. ISA: SNB/HSW could use both divps and vdivps and usage of 2 divps could be better then usage of one vdivps but our current Cost Model cannot help to decide what's better).

269

BTW, I've just realized that Haswell IACA numbers are really closed to the expected ones:

f32 13 vdivss
v4f32 13 vdivps xmm0...
v8f32 26 vdivps ymm0...
f64 20 vdivsd
v2f64 20 vdivpd
v4f62 47 vdivpd ymm0...

It means we have problems with IACA numbers for SNB only (too big difference for xmm and ymm). Maybe new CPUs simply resolved this issue?

1088

JFYI, I got the same numbers for Nehalem with IACA

avt77 updated this revision to Diff 76016.Oct 27 2016, 6:37 AM
avt77 edited edge metadata.

All numbers from IACA were replaced with Agner's numbers

avt77 updated this revision to Diff 76035.Oct 27 2016, 8:10 AM

The wrong SNB numbers were fixed (tnx to Simon Pilgrim)

I think this is almost done now - please can you add the AVX2/Haswell costs:

Haswell
FDIV f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28
FSQRT f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28

Other than that does anyone have any additional feedback?

I think this is almost done now - please can you add the AVX2/Haswell costs:

Haswell
FDIV f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28
FSQRT f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28

Other than that does anyone have any additional feedback?

No, LGTM (modulo what Simon said about Haswell)

Thanks for fixing this, Andrew.

avt77 updated this revision to Diff 76295.Oct 29 2016, 3:40 AM

Haswell numbers added for AVX2

RKSimon accepted this revision.Oct 29 2016, 4:36 AM
RKSimon edited edge metadata.

LGTM with one minor

lib/Target/X86/X86TargetTransformInfo.cpp
1127

Better to use the Pentium III costs F32 = 28, VF432 = 56

This revision is now accepted and ready to land.Oct 29 2016, 4:36 AM
avt77 updated this revision to Diff 76366.Oct 31 2016, 2:35 AM
avt77 edited edge metadata.

FSQRT changes: SSE1 Cost table updated with Pentium III numbers; SSE42 cost table added with Nehalem numbers

This revision was automatically updated to reflect the committed changes.