When calculating a square root using Newton-Raphson with two constants,

a naive implementation is to use five multiplications (four muls to calculate

reciprocal square root and another one to calculate the square root itself).

However, after some reassociation and CSE the same result can be obtained

with only four multiplications. Unfortunately, there's no reliable way to do

such a reassociation in the back-end. So, the patch modifies NR code itself

so that it directly builds optimal code for SQRT and doesn't rely on any

further reassociation.

# Details

# Diff Detail

- Repository
- rL LLVM

### Event Timeline

This seems like a good perf optimization to me in general, but a few high-level questions:

- How does the refactoring of the multiplication affect the accuracy of the results? I attached some test programs that could be modified to answer this in https://llvm.org/bugs/show_bug.cgi?id=21385 .

- The case of converting a sqrt into an estimate sequence is a problem in https://llvm.org/bugs/show_bug.cgi?id=24063 . Would this patch sidestep that bug by producing the more accurate/expected result for that case?

- Given that we have 3 generations of Intel FMA FPUs now (and the non-x86 world has always had FMA), it would be interesting to know the accuracy (and codegen tests should probably be added) for the FMA variant. It would also be good to add tests with 2 refinement steps, so we know this patch is behaving as expected in that case, but...

- (Apologies for straying from the immediate goal of the patch...) Given that Intel FPUs in Broadwell and Skylake have reduced IEEE-compliant div/sqrt to 4 and then 3 cycle throughputs according to Agner's tables, using an estimate is likely a perf misoptimization for current and future Intel big cores. I think that we can fix this using the existing hook 'isFsqrtCheap()', but please confirm that this patch preserves that opportunity.

Thanks for great questions, Sanjay!

I slightly modified the example from PR21385 and compared these two

sequences to calculate square roots (the current one and the patched one):

est1 = (-0.5f * est0) * (-3.0f + est0 * est0 * f) * f; float ae = est0 * f; est2 = (-0.5f * ae) * (-3.0f + est0 * ae);

And I obtained the following results:

est1 est2 Total tests: 2130706432 2130706432 Inexact results: 926539007 834159368 Estimate missed by 1 ULP: 862814916 796017331 Estimate missed by 2 ULP: 62179595 37665787 Estimate missed by 3 ULP: 1537746 476250 Estimate missed by 4 ULP: 6750 0 Estimate missed by >4 ULP: 0 0

As you can see, with the patch square roots are significantly more

accurate on average, though I don't have a good explanation for this.

I performed testing on a number of Intel microarchitectures (including

Atom) and got exactly the same results for all of them.

As for improved hardware SQRT efficiency in modern Intel CPUs, I'm

working on a patch that address this particular issue and I will

share the patch very soon.

Here are the results with vfmadd132ss:

est1 est2 Total tests: 2130706432 2130706432 Inexact results: 911966111 817657369 Estimate missed by 1 ULP: 854798052 785288244 Estimate missed by 2 ULP: 56073347 32117919 Estimate missed by 3 ULP: 1092044 251206 Estimate missed by 4 ULP: 2668 0 Estimate missed by >4 ULP: 0 0

Generally, using FMA improves precision. Again, the new code

sequence produces statistically better results.

Thanks, Nikolai!

I reproduced your results for both FMA and non-FMA on my local Haswell machine.

Here is the output from AMD Jaguar (no FMA):

est1 est2 Inexact results 828912065 694175396 Estimate missed by 1 ULP: 788646331 681356143 Estimate missed by 2 ULP: 39579384 12754737 Estimate missed by 3 ULP: 680779 64516 Estimate missed by 4 ULP: 5571 0 Estimate missed by >= 5 ULP with one N-R step = 0

So again, eliminating the extra multiply benefits accuracy in general. I attached my hacked tester program for this experiment to PR21385 in case anyone else wants to try it or adapt it to non-x86, but I'm assuming the accuracy improvement holds for any architecture because we're eliminating some intermediate error with the refactoring.

Please add a regression test to show the codegen when 2 or more N-R steps are used. PowerPC uses at least 2 refinement steps with 'frsqrte', so it may be easiest to add a sqrt (rather than rsqrt) test to test/CodeGen/PowerPC/recipest.ll instead of adding another RUN to the x86 test.

Please add a regression test to show the x86 codegen with FMA. This can get updated when you do the follow-up patch for isFsqrtCheap() for the faster sqrt/div FPUs in Broadwell, etc.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|

355–365 ↗ | (On Diff #60018) | It's a giant mess...but since you're updating these names, I think the recommended solution is to use the current naming convention for these functions - start with a lowercase letter. |

14651–14652 ↗ | (On Diff #60018) | 'else' goes on the same line as '}'. Consider clang-format of the whole patch; there may be other formatting changes needed. |

14696–14698 ↗ | (On Diff #60018) | No braces need here; see previous comment about using clang-format. |

Thanks for your remarks, Sanjay!

I've fixed formatting issues and uploaded a new version of the patch.

As for tests, I can easily add to `x86/sqrt-fastmath.ll` another RUN

with `-mattr=fma -recip=all:2` to check both FMA and two steps

refinement, but I'm not sure if it is a good idea. I believe

`sqrt-fastmath.ll` is overly fragile now. It can be easily broken by

perfectly valid variations in register allocation, code ordering or code

reassociation. And I'm afraid that additional tests for 2 steps

refinement will make the test even more fragile.

So, I wonder if it would be better to split SQRT testing logically in

two parts:

- check that NR is applied iff necessary (lit-tests)
- check that NR code calculates roots with expected accuracy (execution test)

In other words, my suggestion is to convert the accuracy testers from

PR21385 into new tests for LLVM test-suite instead of adding more

fragile lit-tests here. What do you think?

I think that adding an accuracy test to test-suite is a great idea. And I agree that these tests are susceptible to RA/scheduler changes, but the benefit of exact testing has usually outweighed the maintenance cost in my experience.

So I have another, hopefully better, test suggestion. :)

How about a MIR test? (cc'ing Matthias and Quentin for any MIR test suggestions)

The RUN line would include something like:

$ llc -o - -stop-after machine-scheduler rsqrt.ll -mattr=avx2,fma -recip=sqrtf

And then we can check the output closely for:

%1 = VRSQRTSSr undef %2, %0 %4 = VMULSSrr %1, %1 %4 = VFMADDSSr213m %4, %0, %rip, 1, _, %const.0, _ :: (load 4 from constant-pool) %5 = VMULSSrm %1, %rip, 1, _, %const.1, _ :: (load 4 from constant-pool) %11 = VMULSSrr %5, %0 %7 = VMULSSrr %4, %11 %8 = FsFLD0SS %9 = VCMPSSrr %0, %8, 0 %10 = VFsANDNPSrr %9, %7

This will be immune to RA and other machine-level variations.

Ok, I added tests to check MIR output after instruction selection for

X86 with two refinement steps and with FMA enabled. These seem to cover

all the cases not covered previously.