There is a bug describing poor cost model for floating point operations: Bug 29083 - [X86][SSE] Improve costs for floating point operations. This patch is the second one in series of patches dealing with cost model.
Details
Diff Detail
Event Timeline
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
273 | AVXCustomCostTable was recently added - you can probably merge these into that table and avoid the extra lookup. | |
406 | Put these above the SDIV/UDIV costs so it doesn't look like they are under the same 'avoid vectorize division' comment, which is purely an integer division thing. | |
459 | if (ST->hasSSE()) | |
1072 | Move SSE1FloatCostTable below SSE2CostTbl to keep to 'descending subtarget' convention. You can probably rename it SSE1CostTble as well. | |
1073 | Worth adding a SSE41CostTbl for Core2 era costs? | |
1160 | if (ST->hasSSE()) |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
269 | A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd. I'd expect "2, maybe a bit more", and Agner seems to agree - e.g. for Sandybridge he gives a range of 10-14 cycles for XMM DIVPS, and 20-28 cycles for YMM VDIVPS. Were your IACA checks accounting for additional instructions? Or is this an inconsistency between IACA and Anger's tests? (Note that these numbers are supposed to represent reciprocal throughput, but Agner's data for latency also has factor of ~2) |
lib/Target/X86/X86TargetTransformInfo.cpp | ||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
269 | I use the following numbers: Throughput Analysis ReportBlock Throughput: 14.00 Cycles Throughput Bottleneck: Divider Port Binding In Cycles Per Iteration:| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 || Cycles | 1.0 14.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 0.0 |N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
^ - Micro Fusion happened - ESP Tracking sync uop was issued@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
Total Num Of Uops: 1 atischenko@ip-172-31-21-62:~/iaca-lin64/bin$ Throughput Analysis ReportBlock Throughput: 41.00 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration:| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 || Cycles | 2.0 28.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 |N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
^ - Micro Fusion happened - ESP Tracking sync uop was issued@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
Total Num Of Uops: 3 If we use DV value then it's about 2 times. But I used "Block Throughput" value above. As you see my block is exactly one instruction that's why I decided to use "Block Throughput". Maybe I'm wrong? Can anyone suggest me the right decision? |
I changed everything except the issue with ymm FDIV. As I wrote in my answer on the comment I'm using Block Throughput from IACA tool. If I'm wrong please say me about and I'll recollect all numbers.
I'm swaying toward using Agner's numbers for SandyBridge FDIV / FSQRT costs - would anyone have any objections?
I'd really like to understand the discrepancy between Agner's numbers and the IACA numbers - especially since IACA does give the expected number for the latency.
Andrew, any chance you could ping the IACA people at Intel and ask them about this?
Meanwhile, I think I'd also prefer this to go in with Agner's numbers - and change it to the IACA numbers later if it turns out it's accounting for something real.
IACA doesn't have that great support these days - the website says development has been suspended, although they have announced they will do a BDW/SKL bugfix release at some point. Similar comments about dodgy costs don't seem to be answered.
To the best of my knowledge, Andrew works for Intel, so he may have a better chance of getting an answer. :-)
As to support - the website has a comment from July that states that "[they] are resuming support for Intel(R) Architecture Code Analyzer with BDW and SKL support probably before end of 2016".
So, one can hope...
If I understood correctly I should replace all IACA numbers with Agner's numbers, right? OK, I'll do it.
JFYI, I'm not working in Intel since July but of course I know a lot of guys from Intel and I'll try to ask them about IACA future.
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
269 | Maybe I understood the difference between Anger's and IACA numbers (for SandyBridge at least): Anger's table has numbers for ymm only : but when I played with IACA it showed different types of operands: xmm0, ... and ymm0, .... | |
269 | And another comment: for f32/v4f32 IACA shows 14 for vdivss/vdivps but Agner's table shows the same 10-14 for divss/divps and 20-28 for vdivss/vdivps. Clang generates vdivss/vdivps (for SNB target) that's why I should select 28 as a cost value is we're using Agner's numbers. Is it OK? | |
269 | As an intermediate decision I did the following for SNB numbers: xmm operands use the first number as a cost while ymm operands use the second number. For example: { ISD::FDIV, MVT::f32, 20 }, // SNB from http://www.agner.org/ (IACA: 14) { ISD::FDIV, MVT::v4f32, 20 }, // SNB from http://www.agner.org/ (IACA: 14) { ISD::FDIV, MVT::v8f32, 28 }, // SNB from http://www.agner.org/ (IACA: 41) Is it acceptable? In fact we need here some extension of our current Cost tables: it is not enough to know the instruction itself and the type of its operands. We need some target dependent info as well (e.g. if we're speaking about X86 targets only (exactly our case) then we could add the size of the target registers or something similar, e.g. ISA: SNB/HSW could use both divps and vdivps and usage of 2 divps could be better then usage of one vdivps but our current Cost Model cannot help to decide what's better). | |
269 | BTW, I've just realized that Haswell IACA numbers are really closed to the expected ones: f32 13 vdivss It means we have problems with IACA numbers for SNB only (too big difference for xmm and ymm). Maybe new CPUs simply resolved this issue? | |
1073 | JFYI, I got the same numbers for Nehalem with IACA |
I think this is almost done now - please can you add the AVX2/Haswell costs:
Haswell
FDIV f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28
FSQRT f32/4f32 = 7, 8f32 = 14, f64/2f64 = 14, 4f64 = 28
Other than that does anyone have any additional feedback?
FSQRT changes: SSE1 Cost table updated with Pentium III numbers; SSE42 cost table added with Nehalem numbers
A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd.
I'd expect "2, maybe a bit more", and Agner seems to agree - e.g. for Sandybridge he gives a range of 10-14 cycles for XMM DIVPS, and 20-28 cycles for YMM VDIVPS. Were your IACA checks accounting for additional instructions? Or is this an inconsistency between IACA and Anger's tests?
(Note that these numbers are supposed to represent reciprocal throughput, but Agner's data for latency also has factor of ~2)