This is an archive of the discontinued LLVM Phabricator instance.

[x86] Implement combineRepeatedFPDivisors
ClosedPublic

Authored by spatel on Apr 9 2015, 4:36 PM.

Details

Summary

This is a trivial patch, but I want to make sure that I'm not being too aggressive for any existing chips.

I've set the transform bar at 2 divisions because the fastest x86 FP divider circuit that I know of is in SandyBridge / Haswell at 10 cycle latency (best case) relative to a 5 cycle multiplier. So that's the worst case for this transform (no latency win), but multiplies are obviously pipelined while divisions are not, so there's still a big throughput win which we would expect to show up in typical FP code.

These are the sequences I'm comparing:

divss   %xmm2, %xmm0
mulss   %xmm1, %xmm0
divss   %xmm2, %xmm0

Becomes:

movss   LCPI0_0(%rip), %xmm3    ## xmm3 = mem[0],zero,zero,zero
divss   %xmm2, %xmm3
mulss   %xmm3, %xmm0
mulss   %xmm1, %xmm0
mulss   %xmm3, %xmm0

[Ignore for the moment that we don't optimize the chain of 3 multiplies into 2 independent fmuls followed by 1 dependent fmul...this is the DAG version of: https://llvm.org/bugs/show_bug.cgi?id=21768 ...if we fix that, then the transform becomes even more profitable on all targets.]

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 23542.Apr 9 2015, 4:36 PM
spatel retitled this revision from to [x86] Implement combineRepeatedFPDivisors.
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added a subscriber: Unknown Object (MLST).
qcolombet accepted this revision.Apr 13 2015, 11:51 AM
qcolombet edited edge metadata.

Hi Sanjay,

Looks good to me.

Thanks,
-Quentin

This revision is now accepted and ready to land.Apr 13 2015, 11:51 AM
This revision was automatically updated to reflect the committed changes.

Thanks, Quentin - checked in at r235012.