The SSE rsqrt instruction is a fast reciprocal square estimate (typically <5 cycles) but is currently grouped in the same scheduling IIC_SSE_SQRT* class as the accurate (but very slow) SSE sqrt instruction (often >20 cycles). For code which uses rsqrt (possibly with newton-raphson iterations) this poor scheduling is affecting performance.
This patch splits off the rsqrt instruction from the sqrt instruction scheduling classes and creates new IIC_SSE_RSQRT* classes with latency values based on Agner's tables. The latencies/pipelines for supported x86 targets end up being the same as the rcp(ss,ps) instruction but I've kept them separate.
There is a proposal for a fast-math optimization to use rsqrt + nr (http://llvm.org/bugs/show_bug.cgi?id=20900) which would benefit from this as well.
Note - for the Haswell scheduler I've updated the base model but not altered any of the exceptions/overrides.