This is an archive of the discontinued LLVM Phabricator instance.

Remove redundant FMUL in Newton-Raphson SQRT code
ClosedPublic

Authored by n.bozhenov on Jun 8 2016, 5:36 AM.

Details

Summary

When calculating a square root using Newton-Raphson with two constants,
a naive implementation is to use five multiplications (four muls to calculate
reciprocal square root and another one to calculate the square root itself).
However, after some reassociation and CSE the same result can be obtained
with only four multiplications. Unfortunately, there's no reliable way to do
such a reassociation in the back-end. So, the patch modifies NR code itself
so that it directly builds optimal code for SQRT and doesn't rely on any
further reassociation.

Diff Detail

Event Timeline

n.bozhenov updated this revision to Diff 60018.Jun 8 2016, 5:36 AM
n.bozhenov retitled this revision from to Remove redundant FMUL in Newton-Raphson SQRT code.
n.bozhenov updated this object.
n.bozhenov added a subscriber: llvm-commits.
spatel edited edge metadata.Jun 8 2016, 9:07 AM

This seems like a good perf optimization to me in general, but a few high-level questions:

  1. How does the refactoring of the multiplication affect the accuracy of the results? I attached some test programs that could be modified to answer this in https://llvm.org/bugs/show_bug.cgi?id=21385 .
  1. The case of converting a sqrt into an estimate sequence is a problem in https://llvm.org/bugs/show_bug.cgi?id=24063 . Would this patch sidestep that bug by producing the more accurate/expected result for that case?
  1. Given that we have 3 generations of Intel FMA FPUs now (and the non-x86 world has always had FMA), it would be interesting to know the accuracy (and codegen tests should probably be added) for the FMA variant. It would also be good to add tests with 2 refinement steps, so we know this patch is behaving as expected in that case, but...
  1. (Apologies for straying from the immediate goal of the patch...) Given that Intel FPUs in Broadwell and Skylake have reduced IEEE-compliant div/sqrt to 4 and then 3 cycle throughputs according to Agner's tables, using an estimate is likely a perf misoptimization for current and future Intel big cores. I think that we can fix this using the existing hook 'isFsqrtCheap()', but please confirm that this patch preserves that opportunity.
RKSimon added a subscriber: RKSimon.Jun 8 2016, 2:34 PM

Thanks for great questions, Sanjay!

I slightly modified the example from PR21385 and compared these two
sequences to calculate square roots (the current one and the patched one):

est1 = (-0.5f * est0) * (-3.0f + est0 * est0 * f) * f;
float ae = est0 * f;
est2 = (-0.5f * ae) * (-3.0f + est0 * ae);

And I obtained the following results:

                                   est1         est2
Total tests:                 2130706432   2130706432
Inexact results:              926539007    834159368
Estimate missed by  1 ULP:    862814916    796017331
Estimate missed by  2 ULP:     62179595     37665787
Estimate missed by  3 ULP:      1537746       476250
Estimate missed by  4 ULP:         6750            0
Estimate missed by >4 ULP:            0            0

As you can see, with the patch square roots are significantly more
accurate on average, though I don't have a good explanation for this.
I performed testing on a number of Intel microarchitectures (including
Atom) and got exactly the same results for all of them.

As for improved hardware SQRT efficiency in modern Intel CPUs, I'm
working on a patch that address this particular issue and I will
share the patch very soon.

Here are the results with vfmadd132ss:

                                   est1         est2
Total tests:                 2130706432   2130706432
Inexact results:              911966111    817657369
Estimate missed by  1 ULP:    854798052    785288244
Estimate missed by  2 ULP:     56073347     32117919
Estimate missed by  3 ULP:      1092044       251206
Estimate missed by  4 ULP:         2668            0
Estimate missed by >4 ULP:            0            0

Generally, using FMA improves precision. Again, the new code
sequence produces statistically better results.

spatel added a comment.Jun 9 2016, 7:41 AM

Thanks, Nikolai!
I reproduced your results for both FMA and non-FMA on my local Haswell machine.
Here is the output from AMD Jaguar (no FMA):

                                         est1        est2
Inexact results                     828912065   694175396
 Estimate missed by 1 ULP:          788646331   681356143
 Estimate missed by 2 ULP:           39579384    12754737
 Estimate missed by 3 ULP:             680779       64516
 Estimate missed by 4 ULP:               5571           0
 Estimate missed by >= 5 ULP with one N-R step = 0

So again, eliminating the extra multiply benefits accuracy in general. I attached my hacked tester program for this experiment to PR21385 in case anyone else wants to try it or adapt it to non-x86, but I'm assuming the accuracy improvement holds for any architecture because we're eliminating some intermediate error with the refactoring.

spatel added a comment.Jun 9 2016, 8:39 AM

Please add a regression test to show the codegen when 2 or more N-R steps are used. PowerPC uses at least 2 refinement steps with 'frsqrte', so it may be easiest to add a sqrt (rather than rsqrt) test to test/CodeGen/PowerPC/recipest.ll instead of adding another RUN to the x86 test.

Please add a regression test to show the x86 codegen with FMA. This can get updated when you do the follow-up patch for isFsqrtCheap() for the faster sqrt/div FPUs in Broadwell, etc.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
355–365

It's a giant mess...but since you're updating these names, I think the recommended solution is to use the current naming convention for these functions - start with a lowercase letter.

14651–14652

'else' goes on the same line as '}'. Consider clang-format of the whole patch; there may be other formatting changes needed.

14696–14698

No braces need here; see previous comment about using clang-format.

n.bozhenov updated this revision to Diff 60392.Jun 10 2016, 1:41 PM
n.bozhenov edited edge metadata.

Thanks for your remarks, Sanjay!
I've fixed formatting issues and uploaded a new version of the patch.

As for tests, I can easily add to x86/sqrt-fastmath.ll another RUN
with -mattr=fma -recip=all:2 to check both FMA and two steps
refinement, but I'm not sure if it is a good idea. I believe
sqrt-fastmath.ll is overly fragile now. It can be easily broken by
perfectly valid variations in register allocation, code ordering or code
reassociation. And I'm afraid that additional tests for 2 steps
refinement will make the test even more fragile.

So, I wonder if it would be better to split SQRT testing logically in
two parts:

  1. check that NR is applied iff necessary (lit-tests)
  2. check that NR code calculates roots with expected accuracy (execution test)

In other words, my suggestion is to convert the accuracy testers from
PR21385 into new tests for LLVM test-suite instead of adding more
fragile lit-tests here. What do you think?

n.bozhenov updated this revision to Diff 60402.Jun 10 2016, 2:02 PM
n.bozhenov marked an inline comment as done.

Sorry, last time I uploaded the wrong diff. Here's the correct diff.

n.bozhenov marked 2 inline comments as done.Jun 10 2016, 2:04 PM

As for tests, I can easily add to x86/sqrt-fastmath.ll another RUN
with -mattr=fma -recip=all:2 to check both FMA and two steps
refinement, but I'm not sure if it is a good idea. I believe
sqrt-fastmath.ll is overly fragile now. It can be easily broken by
perfectly valid variations in register allocation, code ordering or code
reassociation. And I'm afraid that additional tests for 2 steps
refinement will make the test even more fragile.

So, I wonder if it would be better to split SQRT testing logically in
two parts:

  1. check that NR is applied iff necessary (lit-tests)
  2. check that NR code calculates roots with expected accuracy (execution test)

In other words, my suggestion is to convert the accuracy testers from
PR21385 into new tests for LLVM test-suite instead of adding more
fragile lit-tests here. What do you think?

I think that adding an accuracy test to test-suite is a great idea. And I agree that these tests are susceptible to RA/scheduler changes, but the benefit of exact testing has usually outweighed the maintenance cost in my experience.

So I have another, hopefully better, test suggestion. :)

How about a MIR test? (cc'ing Matthias and Quentin for any MIR test suggestions)

The RUN line would include something like:
$ llc -o - -stop-after machine-scheduler rsqrt.ll -mattr=avx2,fma -recip=sqrtf

And then we can check the output closely for:

%1 = VRSQRTSSr undef %2, %0
%4 = VMULSSrr %1, %1
%4 = VFMADDSSr213m %4, %0, %rip, 1, _, %const.0, _ :: (load 4 from constant-pool)
%5 = VMULSSrm %1, %rip, 1, _, %const.1, _ :: (load 4 from constant-pool)
%11 = VMULSSrr %5, %0
%7 = VMULSSrr %4, %11
%8 = FsFLD0SS
%9 = VCMPSSrr %0, %8, 0
%10 = VFsANDNPSrr %9, %7

This will be immune to RA and other machine-level variations.

n.bozhenov updated this revision to Diff 60661.Jun 14 2016, 4:14 AM

Ok, I added tests to check MIR output after instruction selection for
X86 with two refinement steps and with FMA enabled. These seem to cover
all the cases not covered previously.

spatel accepted this revision.Jun 14 2016, 8:44 AM
spatel edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Jun 14 2016, 8:44 AM
zinovy.nis accepted this revision.Jun 16 2016, 3:00 AM
zinovy.nis edited edge metadata.
This revision was automatically updated to reflect the committed changes.