This patch simplifies "X/sqrt(X)" to "sqrt(X)". For "X/sqrt(X)", LLVM generates 2 operations "sqrt" and "div". Simplifying it results in one "sqrt" operation. The simplification is enabled when re-association is set.
|20 ms||linux > Flang.Preprocessing::compiler_defined_macros.F90|
Script: -- : 'RUN: at line 6'; /mnt/disks/ssd0/agent/llvm-project/build/bin/f18 -intrinsic-module-directory /mnt/disks/ssd0/agent/llvm-project/build/tools/flang/include/flang -E /mnt/disks/ssd0/agent/llvm-project/flang/test/Preprocessing/compiler_defined_macros.F90 | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck --ignore-case /mnt/disks/ssd0/agent/llvm-project/flang/test/Preprocessing/compiler_defined_macros.F90
|30 ms||linux > LLVM.Transforms/InstSimplify::fdiv.ll|
Script: -- : 'RUN: at line 2'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt < /mnt/disks/ssd0/agent/llvm-project/llvm/test/Transforms/InstSimplify/fdiv.ll -instsimplify -S | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/llvm/test/Transforms/InstSimplify/fdiv.ll
|60 ms||windows > LLVM.Transforms/InstSimplify::fdiv.ll|
Script: -- : 'RUN: at line 2'; c:\ws\w16n2-1\llvm-project\premerge-checks\build\bin\opt.exe < C:\ws\w16n2-1\llvm-project\premerge-checks\llvm\test\Transforms\InstSimplify\fdiv.ll -instsimplify -S | c:\ws\w16n2-1\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w16n2-1\llvm-project\premerge-checks\llvm\test\Transforms\InstSimplify\fdiv.ll
I'm fairly sure this transform is a performance loss. For a target like Skylake Server, a SQRT(x) can take up to 20 cycles. But a RSQRT(x) is about 6 cycles and a MUL(y) is 4 cycles. We'd be better off with a X*RSQRT(X).
Do you expect the more general form to use the pow intrinsic? It would be good to file a bugzilla with examples of how this looks in source/IR, so we know that we are matching the expected patterns. I think that will have to be handled independently from the sqrt special-case.
That is up to backends to decide. InstSimplify/InstCombine (and a few others) are canonicalization, target-independent passes.
A single sqrt(x) is more canonical IR than x/sqrt(x), because it's less instructions and x has less uses.
I agree with both.
Knowing that this is often a perf-critical pattern, we've put a lot of SDAG effort into optimization of it for x86 already, but it's a tricky problem because we get into a similar situation that we have with FMA. Ie, should we favor latency, throughput, or some combo?
Here's an example to show that we are trying to make the right decision per-CPU in codegen:
Note that the codegen choice can differ between scalar/vector - example in the X86.td file:
// FeatureFastScalarFSQRT should be enabled if scalar FSQRT has shorter latency // than the corresponding NR code. FeatureFastVectorFSQRT should be enabled if // vector FSQRT has higher throughput than the corresponding NR code. // The idea is that throughput bound code is likely to be vectorized, so for // vectorized code we should care about the throughput of SQRT operations. // But if the code is scalar that probably means that the code has some kind of // dependency and we should care more about reducing the latency.
After looking at the codegen, I'm not sure if we can do this transform in IR with the expected performance in codegen because the transform loses information:
The codegen for the case of "sqrt(x)" has to account for a 0.0 input. Ie, we filter out a 0.0 (or potentially denorm) input to avoid the NAN answer that we would get from "0.0 / 0.0". But the codegen for the case of "x/sqrt(x)" does not have to do that - NAN is the correct answer for a 0.0 input, so the code has implicitly signaled to us that 0.0 is not a valid input when compiled with -ffast-math (we can ignore possible NANs).
It might help to see the motivating code that produces the x/sqrt(x) pattern to see if there's something else we should be doing there.
Current AMD "x86_64" targets don't have the reciprocal sqrt instruction for the double precision types.
so x/sqrt(x) ends up with "vsqrtsd" followed by "vdivsd". This transform is basically to improve the efficiency.
Ah, I see. I think we should handle that in generic DAGCombiner then. There, we can make the target- and CPU-specific trade-offs necessary to get the (presumably) ideal asm code. I don't know how we would recover the missing div-by-0 info that I mentioned here.
Let me know if you want to try that patch. If not, I can take a shot at it.
For x = 0, x/sqrt(0) result in "nan". However when we specify -ffast-math we are setting "nnan" flag. The nnan flag says "Allow optimizations to assume the arguments and result are not NaN" so we can transform x/sqrt(x) to sqrt(x) under -ffast-math. is that the right understanding here?
The transform is allowed here; it's just not advisable as shown by the codegen for a target/type that expands it to an estimate sequence (ie, we should abandon this patch).
The problem with doing this transform early (in IR) is that we cannot recover the knowledge that 0.0 was not a valid input. So we need to give targets the opportunity to create an estimate first, and only if that does not happen, convert to a single sqrt.