This is an archive of the discontinued LLVM Phabricator instance.

SSE reciprocal square root instruction latencies
ClosedPublic

Authored by RKSimon on Sep 16 2014, 10:46 AM.

Details

Summary

The SSE rsqrt instruction is a fast reciprocal square estimate (typically <5 cycles) but is currently grouped in the same scheduling IIC_SSE_SQRT* class as the accurate (but very slow) SSE sqrt instruction (often >20 cycles). For code which uses rsqrt (possibly with newton-raphson iterations) this poor scheduling is affecting performance.

This patch splits off the rsqrt instruction from the sqrt instruction scheduling classes and creates new IIC_SSE_RSQRT* classes with latency values based on Agner's tables. The latencies/pipelines for supported x86 targets end up being the same as the rcp(ss,ps) instruction but I've kept them separate.

There is a proposal for a fast-math optimization to use rsqrt + nr (http://llvm.org/bugs/show_bug.cgi?id=20900) which would benefit from this as well.

Note - for the Haswell scheduler I've updated the base model but not altered any of the exceptions/overrides.

Diff Detail

Event Timeline

RKSimon updated this revision to Diff 13758.Sep 16 2014, 10:46 AM
RKSimon retitled this revision from to SSE reciprocal square root instruction latencies.
RKSimon updated this object.
RKSimon edited the test plan for this revision. (Show Details)
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: Unknown Object (MLST).

Andy - I understand you're swamped but its been recommended that I add you to look at my rsqrt scheduling patch.

andreadb edited edge metadata.Sep 25 2014, 6:00 AM

Hi Simon,
Sorry for the late reply.

The patch looks good to me.
The changes to instruction latencies and the new instruction itineraries looks ok to me (I can see how latencies are based on Agner's table). However, I think it is better to get the final approval from somebody more familiar with the Intel scheduling models. For example, the change to X86ScheduleAtom.td should probably be reviewed by others.

As a side note: I have run some benchmarks using the compiler with/without your patch. Unfortunately I haven't seen any particular difference in the codegen. It turns out that most of our benchmarks I tried doesn't have good mix of sqrt/rsqrt. Also, as you said, under fastmath we lack of a rule for converting sqrt+div to rsqrt+mul` (PR20900). I am interested to see how this patch will improve things once PR20900 is fixed.

Thanks,
-Andrea

atrick accepted this revision.Sep 25 2014, 10:31 AM
atrick edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Sep 25 2014, 10:31 AM