[X86][SSE] Improve DIV/SQRT throughput estimates for SB/HW schedule models
AbandonedPublic

Authored by RKSimon on Apr 19 2017, 4:38 AM.

Details

Summary

The current DIV/SQRT throughput estimates for SB/HW schedule models use the default 1cy value, which is highly unrealistic.

I've updated the values with estimates based on the latencies which is typically about right for DIV/SQRT units, its also in the ballpark of what Agner suggests - if anyone has even more accurate values that would be great but these alone should be a major improvement to scheduling.

Diff Detail

Repository
rL LLVM
RKSimon created this revision.Apr 19 2017, 4:38 AM
avt77 added a comment.Apr 21 2017, 2:51 AM

What are your plans here? I've just checked (with help of "-print-schedule=true") IMUL and LEA for Jaguar: they are completely wrong if we compare with numbers from http://www.agner.org/optimize/instruction_tables.pdf. Are we going to change all these things step-by-step?

What are your plans here? I've just checked (with help of "-print-schedule=true") IMUL and LEA for Jaguar: they are completely wrong if we compare with numbers from http://www.agner.org/optimize/instruction_tables.pdf. Are we going to change all these things step-by-step?

The basic process will be: add thorough tests, identify issues, fix issues (either direct commit or reviewed patch if it warrants discussion). I'm intending to initially focus on the SSE/AVX instructions so if you want to add scheduler tests for the mul/imul/lea/etc. instructions then I say go for it.

gadi.haber added inline comments.Apr 24 2017, 11:30 PM
lib/Target/X86/X86SchedHaswell.td
139

let NumMicroOps = 1;

143

let NumMicroOps = 2;

148

let NumMicroOps = 1;

152

let NumMicroOps = 2;

lib/Target/X86/X86SchedSandyBridge.td
126

let NumMicroOps = 1;

130

let NumMicroOps = 2;

135

let NumMicroOps = 1;

139

let NumMicroOps = 2;

RKSimon updated this revision to Diff 96539.Apr 25 2017, 6:05 AM

Add NumMicroOps and regenerate (adds 256-bit vector cases which were added recently).

gadi.haber added inline comments.Apr 27 2017, 5:08 AM
lib/Target/X86/X86SchedHaswell.td
140

instruction latency of X87 FDIV in Haswell is actually higher and takes 20 cycles

141

I believe ResourceCycles here should be 1.

145

latency of FDIVLd in Haswell is 24

146

ResourceCycles for FDIVLd is [1, 1]

151

latency of FSqrt in Haswell is 23

156

I don't have the exact latency for Haswell but is larger than 23

157

ResourceCycles is [1, 1]

1926

HWPort15 should actually be changed to HWPort015 in Haswell

1929

ResourceCycles should be [2, 1]

ResourceCycles lists the number of times where HW port was used in the instruction.
In this case HWPort0 is used twice (by uOp1 and uOp2) and HWPort015 is used only once (by uOp3)

RKSimon added inline comments.Apr 27 2017, 8:55 AM
lib/Target/X86/X86SchedHaswell.td
140

Despite its name this scheduling class is also used by the SSE/AVX float double division (just the xmm variants here as the ymm are overridden). Given that we barely use x87 these days aren't we better off using the value just for SSE/AVX?

145

Then why is load latency in HWWriteResPair just 4 cycles?

151

Please can you cite the source of these numbers? I've been careful not to change the current latency values (as shown in the diffs in the tests below) and am just trying to add more realistic throughput values.

1929

I don't think I agree. ResourceCycles is an analogue for throughput here - the number of cycles that the op consumes this resource for in that stage. It should be 12 (ish) cycles to indicate that HWPort0 won't accept instructions for 12 cycles while it completes the division.

RKSimon abandoned this revision.Jun 7 2017, 6:53 AM

D33897 is moving SNB/HW scheduler table to auto-gen