This is a first step for generating SSE rsqrt instructions for reciprocal square root calcs when fast-math is allowed.

For now, be conservative and only enable this for AMD btver2 where performance improves significantly - for example, 29% on llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c if we convert the data type to single-precision float.

We will probably never enable this codegen for any Intel Core* chips because the sqrt/divider circuits are just too fast. On SandyBridge, sqrtss + divss can be as fast as 20 cycles which is better than the 23 cycle critical path for the rsqrt + mul + mul + add + mul estimate.

Follow-on patches may allow reciprocal (rcpss) optimizations, add more vector data types, and enable the optimization for more chips.

More background here: http://llvm.org/bugs/show_bug.cgi?id=20900