This patch simplifies "X/sqrt(X)" to "sqrt(X)". For "X/sqrt(X)", LLVM generates 2 operations "sqrt" and "div". Simplifying it results in one "sqrt" operation. The simplification is enabled when re-association is set.
Diff Detail
Event Timeline
Just a drive-by comment : Possibly the more general form of this needs to be optimised too?
The more general form of this is
X^a * X^b -> X ^ (a+b) X^a / X^b -> X ^ (a-b)
I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).
Do you expect the more general form to use the pow intrinsic? It would be good to file a bugzilla with examples of how this looks in source/IR, so we know that we are matching the expected patterns. I think that will have to be handled independently from the sqrt special-case.
That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
A single sqrt(x) is more canonical IR than x/sqrt(x), because it's less instructions and x has less uses.
I agree with that. It should be canonicalized. It would also be good to make sure that the backends have lowering code in place before introducing a 2x performance hit.
As of now for -march=skylake -Ofast I get.
--Snip--
foo: # @foo
vsqrtsd xmm1, xmm0, xmm0 vdivsd xmm0, xmm0, xmm1 ret
---Snip--
Backend can lower the SQRT(X) back to X * RSQRT(X) in a separate patch?
I agree with both.
Knowing that this is often a perf-critical pattern, we've put a lot of SDAG effort into optimization of it for x86 already, but it's a tricky problem because we get into a similar situation that we have with FMA. Ie, should we favor latency, throughput, or some combo?
Here's an example to show that we are trying to make the right decision per-CPU in codegen:
https://godbolt.org/z/s5GPqc
Note that the codegen choice can differ between scalar/vector - example in the X86.td file:
// FeatureFastScalarFSQRT should be enabled if scalar FSQRT has shorter latency // than the corresponding NR code. FeatureFastVectorFSQRT should be enabled if // vector FSQRT has higher throughput than the corresponding NR code. // The idea is that throughput bound code is likely to be vectorized, so for // vectorized code we should care about the throughput of SQRT operations. // But if the code is scalar that probably means that the code has some kind of // dependency and we should care more about reducing the latency.
Agreed "FeatureFastScalarFSQRT" can be removed if target thinks scalar FSQRT is costly. I see currently set at "SKXTuning" (Skylake).
After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG
The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).
It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.
Current AMD "x86_64" targets don't have the reciprocal sqrt instruction for the double precision types.
so x/sqrt(x) ends up with "vsqrtsd" followed by "vdivsd". This transform is basically to improve the efficiency.
Ah, I see. I think we should handle that in generic DAGCombiner then. There, we can make the target- and CPU-specific trade-offs necessary to get the (presumably) ideal asm code. I don't know how we would recover the missing div-by-0 info that I mentioned here.
Let me know if you want to try that patch. If not, I can take a shot at it.
Added some test coverage here:
rGdd1a900575ff
Feel free to adjust as needed. If I'm seeing correctly, it should be a similar small code patch as this one, just adapted to SDAG nodes/flags.
For x = 0, x/sqrt(0) result in "nan". However when we specify -ffast-math we are setting "nnan" flag. The nnan flag says "Allow optimizations to assume the arguments and result are not NaN" so we can transform x/sqrt(x) to sqrt(x) under -ffast-math. is that the right understanding here?
The transform is allowed here; it's just not advisable as shown by the codegen for a target/type that expands it to an estimate sequence (ie, we should abandon this patch).
The problem with doing this transform early (in IR) is that we cannot recover the knowledge that 0.0 was not a valid input. So we need to give targets the opportunity to create an estimate first, and only if that does not happen, convert to a single sqrt.