This is an archive of the discontinued LLVM Phabricator instance.

Use rcpss/rcpps (X86) to speed up reciprocal calcs (PR21385)
ClosedPublic

Authored by spatel on Nov 7 2014, 12:06 PM.

Details

Summary

This is a first step for generating SSE rcp instructions for reciprocal calcs when fast-math allows it. This is very similar to the rsqrt optimization enabled in D5658 ( http://reviews.llvm.org/rL220570 ).

For now, be conservative and only enable this for AMD btver2 where performance improves significantly both in terms of latency and throughput.

We may never enable this codegen for Intel Core* chips because the divider circuits are just too fast. On SandyBridge, divss can be as fast as 10 cycles versus the 21 cycle critical path for the rcp + mul + sub + mul + add estimate.

Follow-on patches may allow configuration of the number of Newton-Raphson refinement steps, add AVX512 support, and enable the optimization for more chips.

More background here: http://llvm.org/bugs/show_bug.cgi?id=21385

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 15935.Nov 7 2014, 12:06 PM
spatel retitled this revision from to Use rcpss/rcpps (X86) to speed up reciprocal calcs (PR21385).
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: hfinkel, andreadb, nadav.
spatel added a subscriber: Unknown Object (MLST).
hfinkel accepted this revision.Nov 11 2014, 12:11 AM
hfinkel edited edge metadata.

This is very similar to the rsqrt optimization enabled in D5658 ( http://reviews.llvm.org/rL220570 ).

Yes, indeed it seems that way (when you commit, make sure you mention the commit revision corresponding to D5658 in the commit message).

LGTM.

This revision is now accepted and ready to land.Nov 11 2014, 12:11 AM
spatel closed this revision.Nov 11 2014, 12:42 PM
spatel updated this revision to Diff 16056.

Closed by commit rL221706 (authored by @spatel).

Thanks, Hal. Committed with r221706.