This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Analysis/
-
Analysis/
-
InstructionSimplify.cpp
-
test/Transforms/InstSimplify/
-
Transforms/
-
InstSimplify/
-
fdiv.ll

Differential D85709

[InstSimplify] Implement Instruction simplification for X/sqrt(X) to sqrt(X).
Needs ReviewPublic

Authored by venkataramanan.kumar.llvm on Aug 11 2020, 12:09 AM.

Download Raw Diff

Details

Reviewers

spatel
raghesh

Summary

This patch simplifies "X/sqrt(X)" to "sqrt(X)". For "X/sqrt(X)", LLVM generates 2 operations "sqrt" and "div". Simplifying it results in one "sqrt" operation. The simplification is enabled when re-association is set.

Diff Detail

Unit TestsFailed

	Time	Test
	20 ms	linux > Flang.Preprocessing::compiler_defined_macros.F90
	30 ms	linux > LLVM.Transforms/InstSimplify::fdiv.ll
	60 ms	windows > LLVM.Transforms/InstSimplify::fdiv.ll

Event Timeline

venkataramanan.kumar.llvm created this revision.Aug 11 2020, 12:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 11 2020, 12:09 AM

Herald added subscribers: llvm-commits, hiraditya. · View Herald Transcript

venkataramanan.kumar.llvm requested review of this revision.Aug 11 2020, 12:09 AM

Fixed test case to check for proper return value.

Harbormaster completed remote builds in B67847: Diff 284586.Aug 11 2020, 12:48 AM

xbolva00 added a subscriber: xbolva00.Aug 11 2020, 1:00 AM

Just a drive-by comment : Possibly the more general form of this needs to be optimised too?

The more general form of this is

X^a * X^b -> X ^ (a+b)

X^a / X^b -> X ^ (a-b)

I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).

In D85709#2209206, @grandinj wrote:
Just a drive-by comment : Possibly the more general form of this needs to be optimised too?

The more general form of this is
X^a * X^b -> X ^ (a+b)

X^a / X^b -> X ^ (a-b)

Do you expect the more general form to use the pow intrinsic? It would be good to file a bugzilla with examples of how this looks in source/IR, so we know that we are matching the expected patterns. I think that will have to be handled independently from the sqrt special-case.

In D85709#2209979, @cameron.mcinally wrote:

I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).

That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
A single sqrt(x) is more canonical IR than x/sqrt(x), because it's less instructions and x has less uses.

In D85709#2210034, @lebedev.ri wrote:

In D85709#2209979, @cameron.mcinally wrote:

I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).

That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
A single sqrt(x) is more canonical IR than x/sqrt(x), because it's less instructions and x has less uses.

I agree with that. It should be canonicalized. It would also be good to make sure that the backends have lowering code in place before introducing a 2x performance hit.

In D85709#2210042, @cameron.mcinally wrote:

In D85709#2210034, @lebedev.ri wrote:

In D85709#2209979, @cameron.mcinally wrote:

I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).

That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
A single sqrt(x) is more canonical IR than x/sqrt(x), because it's less instructions and x has less uses.

I agree with that. It should be canonicalized. It would also be good to make sure that the backends have lowering code in place before introducing a 2x performance hit.

As of now for -march=skylake -Ofast I get.

https://godbolt.org/z/jqWzPq

--Snip--
foo: # @foo

vsqrtsd xmm1, xmm0, xmm0
vdivsd xmm0, xmm0, xmm1
ret

---Snip--

Backend can lower the SQRT(X) back to X * RSQRT(X) in a separate patch?

Yeah, separate patch is okay. A SQRT+DIV is definitely bad.

In D85709#2210042, @cameron.mcinally wrote:

In D85709#2210034, @lebedev.ri wrote:

In D85709#2209979, @cameron.mcinally wrote:

I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).

That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
A single sqrt(x) is more canonical IR than x/sqrt(x), because it's less instructions and x has less uses.

I agree with that. It should be canonicalized. It would also be good to make sure that the backends have lowering code in place before introducing a 2x performance hit.

I agree with both.
Knowing that this is often a perf-critical pattern, we've put a lot of SDAG effort into optimization of it for x86 already, but it's a tricky problem because we get into a similar situation that we have with FMA. Ie, should we favor latency, throughput, or some combo?
Here's an example to show that we are trying to make the right decision per-CPU in codegen:
https://godbolt.org/z/s5GPqc
Note that the codegen choice can differ between scalar/vector - example in the X86.td file:

// FeatureFastScalarFSQRT should be enabled if scalar FSQRT has shorter latency
// than the corresponding NR code. FeatureFastVectorFSQRT should be enabled if
// vector FSQRT has higher throughput than the corresponding NR code.
// The idea is that throughput bound code is likely to be vectorized, so for
// vectorized code we should care about the throughput of SQRT operations.
// But if the code is scalar that probably means that the code has some kind of
// dependency and we should care more about reducing the latency.

In D85709#2210125, @spatel wrote:
In D85709#2210042, @cameron.mcinally wrote:

In D85709#2210034, @lebedev.ri wrote:

In D85709#2209979, @cameron.mcinally wrote:

I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).

That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
A single sqrt(x) is more canonical IR than x/sqrt(x), because it's less instructions and x has less uses.

I agree with that. It should be canonicalized. It would also be good to make sure that the backends have lowering code in place before introducing a 2x performance hit.

I agree with both.
Knowing that this is often a perf-critical pattern, we've put a lot of SDAG effort into optimization of it for x86 already, but it's a tricky problem because we get into a similar situation that we have with FMA. Ie, should we favor latency, throughput, or some combo?
Here's an example to show that we are trying to make the right decision per-CPU in codegen:
https://godbolt.org/z/s5GPqc
Note that the codegen choice can differ between scalar/vector - example in the X86.td file:
// FeatureFastScalarFSQRT should be enabled if scalar FSQRT has shorter latency
// than the corresponding NR code. FeatureFastVectorFSQRT should be enabled if
// vector FSQRT has higher throughput than the corresponding NR code.
// The idea is that throughput bound code is likely to be vectorized, so for
// vectorized code we should care about the throughput of SQRT operations.
// But if the code is scalar that probably means that the code has some kind of
// dependency and we should care more about reducing the latency.

Agreed "FeatureFastScalarFSQRT" can be removed if target thinks scalar FSQRT is costly. I see currently set at "SKXTuning" (Skylake).

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

In D85709#2211204, @spatel wrote:

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Excellent catch, Sanjay. The sqrt(x) -> x*rsqrt(x) transform is not safe for x==0.

In D85709#2211204, @spatel wrote:

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Current AMD "x86_64" targets don't have the reciprocal sqrt instruction for the double precision types.
so x/sqrt(x) ends up with "vsqrtsd" followed by "vdivsd". This transform is basically to improve the efficiency.

In D85709#2214967, @venkataramanan.kumar.llvm wrote:

In D85709#2211204, @spatel wrote:

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Current AMD "x86_64" targets don't have the reciprocal sqrt instruction for the double precision types.
so x/sqrt(x) ends up with "vsqrtsd" followed by "vdivsd". This transform is basically to improve the efficiency.

Ah, I see. I think we should handle that in generic DAGCombiner then. There, we can make the target- and CPU-specific trade-offs necessary to get the (presumably) ideal asm code. I don't know how we would recover the missing div-by-0 info that I mentioned here.
Let me know if you want to try that patch. If not, I can take a shot at it.

In D85709#2215718, @spatel wrote:

In D85709#2214967, @venkataramanan.kumar.llvm wrote:

In D85709#2211204, @spatel wrote:

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Current AMD "x86_64" targets don't have the reciprocal sqrt instruction for the double precision types.
so x/sqrt(x) ends up with "vsqrtsd" followed by "vdivsd". This transform is basically to improve the efficiency.

Ah, I see. I think we should handle that in generic DAGCombiner then. There, we can make the target- and CPU-specific trade-offs necessary to get the (presumably) ideal asm code. I don't know how we would recover the missing div-by-0 info that I mentioned here.
Let me know if you want to try that patch. If not, I can take a shot at it.

Sure I will work on the DAGCombiner patch .

In D85709#2215786, @venkataramanan.kumar.llvm wrote:

In D85709#2215718, @spatel wrote:

In D85709#2214967, @venkataramanan.kumar.llvm wrote:

In D85709#2211204, @spatel wrote:

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Current AMD "x86_64" targets don't have the reciprocal sqrt instruction for the double precision types.
so x/sqrt(x) ends up with "vsqrtsd" followed by "vdivsd". This transform is basically to improve the efficiency.

Ah, I see. I think we should handle that in generic DAGCombiner then. There, we can make the target- and CPU-specific trade-offs necessary to get the (presumably) ideal asm code. I don't know how we would recover the missing div-by-0 info that I mentioned here.
Let me know if you want to try that patch. If not, I can take a shot at it.

Sure I will work on the DAGCombiner patch .

Added some test coverage here:
rGdd1a900575ff

Feel free to adjust as needed. If I'm seeing correctly, it should be a similar small code patch as this one, just adapted to SDAG nodes/flags.

In D85709#2215718, @spatel wrote:

In D85709#2214967, @venkataramanan.kumar.llvm wrote:

In D85709#2211204, @spatel wrote:

After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
https://godbolt.org/z/7b84rG

The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).

It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.

Current AMD "x86_64" targets don't have the reciprocal sqrt instruction for the double precision types.
so x/sqrt(x) ends up with "vsqrtsd" followed by "vdivsd". This transform is basically to improve the efficiency.

Ah, I see. I think we should handle that in generic DAGCombiner then. There, we can make the target- and CPU-specific trade-offs necessary to get the (presumably) ideal asm code. I don't know how we would recover the missing div-by-0 info that I mentioned here.
Let me know if you want to try that patch. If not, I can take a shot at it.

For x = 0, x/sqrt(0) result in "nan". However when we specify -ffast-math we are setting "nnan" flag. The nnan flag says "Allow optimizations to assume the arguments and result are not NaN" so we can transform x/sqrt(x) to sqrt(x) under -ffast-math. is that the right understanding here?

In D85709#2216241, @venkataramanan.kumar.llvm wrote:

For x = 0, x/sqrt(0) result in "nan". However when we specify -ffast-math we are setting "nnan" flag. The nnan flag says "Allow optimizations to assume the arguments and result are not NaN" so we can transform x/sqrt(x) to sqrt(x) under -ffast-math. is that the right understanding here?

The transform is allowed here; it's just not advisable as shown by the codegen for a target/type that expands it to an estimate sequence (ie, we should abandon this patch).
The problem with doing this transform early (in IR) is that we cannot recover the knowledge that 0.0 was not a valid input. So we need to give targets the opportunity to create an estimate first, and only if that does not happen, convert to a single sqrt.

spatel mentioned this in rG62e91bf56333: [DAGCombine]: Fold X/Sqrt(X) to Sqrt(X).Aug 24 2020, 3:16 PM

spatel mentioned this in D86726: [InstCombine]: Transform 1.0/sqrt(X) * X to X/sqrt(X) and X * 1.0/sqrt(X) to X/sqrt(X).Aug 28 2020, 5:17 AM

Revision Contents

Path

Size

llvm/

lib/

Analysis/

InstructionSimplify.cpp

5 lines

test/

Transforms/

InstSimplify/

fdiv.ll

10 lines

Diff 284586

llvm/lib/Analysis/InstructionSimplify.cpp

Show First 20 Lines • Show All 4,884 Lines • ▼ Show 20 Lines	if (FMF.noNaNs()) {
// -X / X -> -1.0 and		// -X / X -> -1.0 and
// X / -X -> -1.0 are legal when NaNs are ignored.		// X / -X -> -1.0 are legal when NaNs are ignored.
// We can ignore signed zeros because +-0.0/+-0.0 is NaN and ignored.		// We can ignore signed zeros because +-0.0/+-0.0 is NaN and ignored.
if (match(Op0, m_FNegNSZ(m_Specific(Op1))) \|\|		if (match(Op0, m_FNegNSZ(m_Specific(Op1))) \|\|
match(Op1, m_FNegNSZ(m_Specific(Op0))))		match(Op1, m_FNegNSZ(m_Specific(Op0))))
return ConstantFP::get(Op0->getType(), -1.0);		return ConstantFP::get(Op0->getType(), -1.0);
}		}

		// x/sqrt(x) = sqrt(x)
		if (match(Op1, m_Intrinsic<Intrinsic::sqrt>(m_Specific(Op0))) &&
		FMF.allowReassoc() && FMF.noNaNs() && FMF.noSignedZeros())
		return Op1;

return nullptr;		return nullptr;
}		}

Value llvm::SimplifyFDivInst(Value Op0, Value *Op1, FastMathFlags FMF,		Value llvm::SimplifyFDivInst(Value Op0, Value *Op1, FastMathFlags FMF,
const SimplifyQuery &Q) {		const SimplifyQuery &Q) {
return ::SimplifyFDivInst(Op0, Op1, FMF, Q, RecursionLimit);		return ::SimplifyFDivInst(Op0, Op1, FMF, Q, RecursionLimit);
}		}

▲ Show 20 Lines • Show All 962 Lines • Show Last 20 Lines

llvm/test/Transforms/InstSimplify/fdiv.ll

	Show All 20 Lines
	; CHECK-LABEL: @fmul_fdiv_common_operand(			; CHECK-LABEL: @fmul_fdiv_common_operand(
	; CHECK-NEXT: ret double %x			; CHECK-NEXT: ret double %x
	;			;
	%m = fmul double %x, %y			%m = fmul double %x, %y
	%d = fdiv reassoc nnan double %m, %y			%d = fdiv reassoc nnan double %m, %y
	ret double %d			ret double %d
	}			}

				define double @sqrt_fdiv_common_operand(double %x) {
				; CHECK-LABEL: @sqrt_fdiv_common_operand(
				; CHECK-NEXT: %0 = tail call fast double @llvm.sqrt.f64(double %x)
				; CHECK-NEXT: ret double %x
				;
				%0 = tail call fast double @llvm.sqrt.f64(double %x)
				%1 = fdiv fast double %x, %0
				ret double %1
				}

	; Negative test - the fdiv must be reassociative and not allow NaNs.			; Negative test - the fdiv must be reassociative and not allow NaNs.

	define double @fmul_fdiv_common_operand_too_strict(double %x, double %y) {			define double @fmul_fdiv_common_operand_too_strict(double %x, double %y) {
	; CHECK-LABEL: @fmul_fdiv_common_operand_too_strict(			; CHECK-LABEL: @fmul_fdiv_common_operand_too_strict(
	; CHECK-NEXT: [[M:%.*]] = fmul fast double %x, %y			; CHECK-NEXT: [[M:%.*]] = fmul fast double %x, %y
	; CHECK-NEXT: [[D:%.*]] = fdiv reassoc double [[M]], %y			; CHECK-NEXT: [[D:%.*]] = fdiv reassoc double [[M]], %y
	; CHECK-NEXT: ret double [[D]]			; CHECK-NEXT: ret double [[D]]
	;			;
	Show All 16 Lines