This is a first step for generating SSE rsqrt instructions for reciprocal square root calcs when fast-math is allowed.
For now, be conservative and only enable this for AMD btver2 where performance improves significantly - for example, 29% on llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c if we convert the data type to single-precision float.
We will probably never enable this codegen for any Intel Core* chips because the sqrt/divider circuits are just too fast. On SandyBridge, sqrtss + divss can be as fast as 20 cycles which is better than the 23 cycle critical path for the rsqrt + mul + mul + add + mul estimate.
Follow-on patches may allow reciprocal (rcpss) optimizations, add more vector data types, and enable the optimization for more chips.
More background here: http://llvm.org/bugs/show_bug.cgi?id=20900
I'd really prefer that you put the 2-constant version of the algorithm into the DAGCombiner along side the 1-constant version, and just let the target pick. The algorithm itself is really a mathematical expression, and not at all really target dependent, and we should try to keep such things available to other targets without copy-and-paste.
Ideally, we'd then also have a flag to force one or the other, so that way PPC can default to the 1-constant version, X86 can default to the 2-constant version, but there's a command-line option I can use to force the choice for benchmarking.