This is an archive of the discontinued LLVM Phabricator instance.

[InstSimplify] Implement Instruction simplification for X/sqrt(X) to sqrt(X).
Needs ReviewPublic

Authored by venkataramanan.kumar.llvm on Aug 11 2020, 12:09 AM.

Details

Reviewers
spatel
raghesh
Summary

This patch simplifies "X/sqrt(X)" to "sqrt(X)". For "X/sqrt(X)", LLVM generates 2 operations "sqrt" and "div". Simplifying it results in one "sqrt" operation. The simplification is enabled when re-association is set.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptAug 11 2020, 12:09 AM
venkataramanan.kumar.llvm requested review of this revision.Aug 11 2020, 12:09 AM

Fixed test case to check for proper return value.

Just a drive-by comment : Possibly the more general form of this needs to be optimised too?

The more general form of this is

X^a * X^b -> X ^ (a+b)

X^a / X^b -> X ^ (a-b)

I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).

Just a drive-by comment : Possibly the more general form of this needs to be optimised too?

The more general form of this is

X^a * X^b -> X ^ (a+b)

X^a / X^b -> X ^ (a-b)

Do you expect the more general form to use the pow intrinsic? It would be good to file a bugzilla with examples of how this looks in source/IR, so we know that we are matching the expected patterns. I think that will have to be handled independently from the sqrt special-case.

I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).

That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
A single sqrt(x) is more canonical IR than x/sqrt(x), because it's less instructions and x has less uses.

I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).

That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
A single sqrt(x) is more canonical IR than x/sqrt(x), because it's less instructions and x has less uses.

I agree with that. It should be canonicalized. It would also be good to make sure that the backends have lowering code in place before introducing a 2x performance hit.

I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).

That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
A single sqrt(x) is more canonical IR than x/sqrt(x), because it's less instructions and x has less uses.

I agree with that. It should be canonicalized. It would also be good to make sure that the backends have lowering code in place before introducing a 2x performance hit.

As of now for -march=skylake -Ofast I get.

https://godbolt.org/z/jqWzPq

--Snip--
foo: # @foo

vsqrtsd xmm1, xmm0, xmm0
vdivsd xmm0, xmm0, xmm1
ret

---Snip--

Backend can lower the SQRT(X) back to X * RSQRT(X) in a separate patch?

Yeah, separate patch is okay. A SQRT+DIV is definitely bad.

I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).

That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
A single sqrt(x) is more canonical IR than x/sqrt(x), because it's less instructions and x has less uses.

I agree with that. It should be canonicalized. It would also be good to make sure that the backends have lowering code in place before introducing a 2x performance hit.

I agree with both.
Knowing that this is often a perf-critical pattern, we've put a lot of SDAG effort into optimization of it for x86 already, but it's a tricky problem because we get into a similar situation that we have with FMA. Ie, should we favor latency, throughput, or some combo?
Here's an example to show that we are trying to make the right decision per-CPU in codegen:
https://godbolt.org/z/s5GPqc
Note that the codegen choice can differ between scalar/vector - example in the X86.td file:

// FeatureFastScalarFSQRT should be enabled if scalar FSQRT has shorter latency
// than the corresponding NR code. FeatureFastVectorFSQRT should be enabled if
// vector FSQRT has higher throughput than the corresponding NR code.
// The idea is that throughput bound code is likely to be vectorized, so for
// vectorized code we should care about the throughput of SQRT operations.
// But if the code is scalar that probably means that the code has some kind of
// dependency and we should care more about reducing the latency.

I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).

That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
A single sqrt(x) is more canonical IR than x/sqrt(x), because it's less instructions and x has less uses.

I agree with that. It should be canonicalized. It would also be good to make sure that the backends have lowering code in place before introducing a 2x performance hit.

I agree with both.
Knowing that this is often a perf-critical pattern, we've put a lot of SDAG effort into optimization of it for x86 already, but it's a tricky problem because we get into a similar situation that we have with FMA. Ie, should we favor latency, throughput, or some combo?
Here's an example to show that we are trying to make the right decision per-CPU in codegen:
https://godbolt.org/z/s5GPqc
Note that the codegen choice can differ between scalar/vector - example in the X86.td file:

// FeatureFastScalarFSQRT should be enabled if scalar FSQRT has shorter latency
// than the corresponding NR code. FeatureFastVectorFSQRT should be enabled if
// vector FSQRT has higher throughput than the corresponding NR code.
// The idea is that throughput bound code is likely to be vectorized, so for
// vectorized code we should care about the throughput of SQRT operations.
// But if the code is scalar that probably means that the code has some kind of
// dependency and we should care more about reducing the latency.

Agreed "FeatureFastScalarFSQRT" can be removed if target thinks scalar FSQRT is costly. I see currently set at "SKXTuning" (Skylake).

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Excellent catch, Sanjay. The sqrt(x) -> x*rsqrt(x) transform is not safe for x==0.

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Current AMD "x86_64" targets don't have the reciprocal sqrt instruction for the double precision types.
so x/sqrt(x) ends up with "vsqrtsd" followed by "vdivsd". This transform is basically to improve the efficiency.

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Current AMD "x86_64" targets don't have the reciprocal sqrt instruction for the double precision types.
so x/sqrt(x) ends up with "vsqrtsd" followed by "vdivsd". This transform is basically to improve the efficiency.

Ah, I see. I think we should handle that in generic DAGCombiner then. There, we can make the target- and CPU-specific trade-offs necessary to get the (presumably) ideal asm code. I don't know how we would recover the missing div-by-0 info that I mentioned here.
Let me know if you want to try that patch. If not, I can take a shot at it.

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Current AMD "x86_64" targets don't have the reciprocal sqrt instruction for the double precision types.
so x/sqrt(x) ends up with "vsqrtsd" followed by "vdivsd". This transform is basically to improve the efficiency.

Ah, I see. I think we should handle that in generic DAGCombiner then. There, we can make the target- and CPU-specific trade-offs necessary to get the (presumably) ideal asm code. I don't know how we would recover the missing div-by-0 info that I mentioned here.
Let me know if you want to try that patch. If not, I can take a shot at it.

Sure I will work on the DAGCombiner patch .

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Current AMD "x86_64" targets don't have the reciprocal sqrt instruction for the double precision types.
so x/sqrt(x) ends up with "vsqrtsd" followed by "vdivsd". This transform is basically to improve the efficiency.

Ah, I see. I think we should handle that in generic DAGCombiner then. There, we can make the target- and CPU-specific trade-offs necessary to get the (presumably) ideal asm code. I don't know how we would recover the missing div-by-0 info that I mentioned here.
Let me know if you want to try that patch. If not, I can take a shot at it.

Sure I will work on the DAGCombiner patch .

Added some test coverage here:
rGdd1a900575ff

Feel free to adjust as needed. If I'm seeing correctly, it should be a similar small code patch as this one, just adapted to SDAG nodes/flags.

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Current AMD "x86_64" targets don't have the reciprocal sqrt instruction for the double precision types.
so x/sqrt(x) ends up with "vsqrtsd" followed by "vdivsd". This transform is basically to improve the efficiency.

Ah, I see. I think we should handle that in generic DAGCombiner then. There, we can make the target- and CPU-specific trade-offs necessary to get the (presumably) ideal asm code. I don't know how we would recover the missing div-by-0 info that I mentioned here.
Let me know if you want to try that patch. If not, I can take a shot at it.

For x = 0, x/sqrt(0) result in "nan". However when we specify -ffast-math we are setting "nnan" flag. The nnan flag says "Allow optimizations to assume the arguments and result are not NaN" so we can transform x/sqrt(x) to sqrt(x) under -ffast-math. is that the right understanding here?

For x = 0, x/sqrt(0) result in "nan". However when we specify -ffast-math we are setting "nnan" flag. The nnan flag says "Allow optimizations to assume the arguments and result are not NaN" so we can transform x/sqrt(x) to sqrt(x) under -ffast-math. is that the right understanding here?

The transform is allowed here; it's just not advisable as shown by the codegen for a target/type that expands it to an estimate sequence (ie, we should abandon this patch).
The problem with doing this transform early (in IR) is that we cannot recover the knowledge that 0.0 was not a valid input. So we need to give targets the opportunity to create an estimate first, and only if that does not happen, convert to a single sqrt.