Page MenuHomePhabricator

Improve division estimation of floating points.
ClosedPublic

Authored by qiucf on Aug 10 2019, 3:40 AM.

Details

Summary

Current implementation of _fast_ division (A/B) is to:

  1. Get an initial estimation of reciprocal of B
  2. Use Newton's iteration method to improve the reciprocal
  3. Multiply the estimation with A

Compared with GCC, this loses some precision since multiplication is done after all iterations.

This patch is to do multiplication before the last iteration to make the result more accurate. It won't add/change any existing nodes/instructions except reordering calculation.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Missing test coverage; performance/precision overview - how much of performance do we loose for how much extra precision, comparison with native division?

ICC behaviour?

I suppose this is essentially an RFC to gauge the community's interest in improving the algorithm to gain more precision.
I think that in order to pull the trigger on a change such as this we would need the following to happen:

  1. The patch needs full context to be reviewable
  2. The formatting needs to be fixed up (some lines too long, etc.)
  3. We need testing (I imagine this makes a bunch of LIT tests fail which need to be updated)
  4. b) It might not be a bad idea to add a test to test-suite that will do some fast division vs. non-fast division to make sure the accuracy isn't too bad
  5. We should do some thorough analysis of the accuracy of the algorithm vs. the HW implementation on a wide range of values on a couple of targets (as was suggested above). I know this was done on PPC, so please share those results and try to replicate on another easily available target (such as X86).
qiucf added a comment.Aug 14 2019, 1:17 AM

Missing test coverage; performance/precision overview - how much of performance do we loose for how much extra precision, comparison with native division?

The patch won't affect the performance by either comparing running time or analyzing it in theory, since actually it just does a re-order for them, producing no more DAG nodes/instructions.

Compared with native division (tested with our own math library on PPC), the maximum error from estimation by GCC is 0.5ulp, while that of Clang is >4ulp.

ICC behaviour?

Do you mean the behavior is the same as ICC? I've not tested it, but the patch is to make result as precise as GCC.

qiucf added a comment.Aug 14 2019, 1:28 AM
  1. The patch needs full context to be reviewable
  2. The formatting needs to be fixed up (some lines too long, etc.)
  3. We need testing (I imagine this makes a bunch of LIT tests fail which need to be updated)

I'm updating the patch :) Broken LIT tests aren't many.

  1. It might not be a bad idea to add a test to test-suite that will do some fast division vs. non-fast division to make sure the accuracy isn't too bad

I'll add a portable case can be compiled and run at all platforms. That can be merged into our test-suite.

  1. We should do some thorough analysis of the accuracy of the algorithm vs. the HW implementation on a wide range of values on a couple of targets (as was suggested above). I know this was done on PPC, so please share those results and try to replicate on another easily available target (such as X86).

Yes. But actually only a few platforms have support for such estimation. Clang doesn't do it on X86 but GCC does. I'll add the author to reviewer list.

qiucf edited the summary of this revision. (Show Details)Aug 14 2019, 1:32 AM
qiucf added a reviewer: spatel.
qiucf updated this revision to Diff 215048.Aug 14 2019, 1:39 AM

Update patch format.

spatel added a subscriber: scanon.Aug 14 2019, 7:16 AM

x86 is moving away from reciprocal estimate code because recent hardware implements a real fdiv at about the same speed as reciprocal estimate+refinement. So x86 perf probably isn't that big of a concern, but it would be good to see the regression test diffs (use the auto-generation script to update those files).

I don't have a good sense of the accuracy gains/trade-offs from pulling the numerator into the estimate calc. Ping @scanon for better insight/data on that.

qiucf updated this revision to Diff 215542.Aug 16 2019, 2:08 AM

Fix broken LIT tests.

xbolva00 accepted this revision.Aug 16 2019, 4:18 AM

(Regressions)

llvm/test/CodeGen/PowerPC/combine-fneg.ll
19 ↗(On Diff #215542)

Regression

llvm/test/CodeGen/X86/recip-fastmath2.ll
602 ↗(On Diff #215542)

Extra instruction

This revision is now accepted and ready to land.Aug 16 2019, 4:18 AM
xbolva00 requested changes to this revision.Aug 16 2019, 4:19 AM
This revision now requires changes to proceed.Aug 16 2019, 4:19 AM
spatel added inline comments.
llvm/test/CodeGen/X86/recip-fastmath2.ll
602 ↗(On Diff #215542)

It is an extra instruction, but it's not an extra uop. We're loading the constant numerator into a register because it is used twice now.

My guess is that fdiv with a constant isn't the usual case. We should have tests with 2 variable operands (fdiv x, y).

That might show that we save an instruction/uop with this patch because we don't have to load the "1.0" constant.

@qiucf - please rebase and update for test changes after rL369106.

@qiucf - please rebase and update for test changes after rL369106.

Oops - make that rebase for after rL369107.

qiucf updated this revision to Diff 215779.Aug 18 2019, 9:48 AM

Rebase to fix regression.

xbolva00 resigned from this revision.Aug 18 2019, 9:55 AM
qiucf marked an inline comment as done.Aug 18 2019, 10:03 PM
qiucf added a comment.Aug 22 2019, 3:01 AM

I posted a simple test for fp division on GitHub.
This patch helps on the accuracy especially when compared with GCC. :)
@nemanjai @lebedev.ri

hfinkel added inline comments.Aug 22 2019, 5:55 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
20080 ↗(On Diff #215779)

"last time of iteration" -> "last iteration"

"try taking numerator into consideration" -> "also multiply by the numerator"

llvm/test/CodeGen/PowerPC/combine-fneg.ll
19 ↗(On Diff #215542)

There's still an extra arithmetic instruction here?

qiucf marked 3 inline comments as done.Aug 26 2019, 6:16 AM
qiucf added inline comments.
llvm/test/CodeGen/PowerPC/combine-fneg.ll
19 ↗(On Diff #215542)

It's as expected. Since visitFDIV calls combineRepeatedFPDivisors to transform the vector divisions before BuildDivEstimate.

But the extra instructions are really redundant here. If visitFMA folds (fma (fneg a) (fneg b) c) into (fma a b c) like what visitFMUL does, the extra instructions will get eliminated. This can be done in future patch.

qiucf updated this revision to Diff 217129.Aug 26 2019, 6:17 AM
qiucf marked an inline comment as done.

Fix typo and rebase.

spatel added inline comments.Aug 29 2019, 1:01 PM
llvm/test/CodeGen/PowerPC/combine-fneg.ll
19 ↗(On Diff #215542)

Did rL370071 solve that?

qiucf marked 2 inline comments as done.Sep 2 2019, 12:12 AM
qiucf added inline comments.
llvm/test/CodeGen/PowerPC/combine-fneg.ll
19 ↗(On Diff #215542)

Yes, there're no more instructions now.

qiucf updated this revision to Diff 218292.Sep 2 2019, 12:14 AM
qiucf marked an inline comment as done.

Update test to reflect changes introduced in rL370071.

spatel accepted this revision.Sep 2 2019, 6:40 AM

LGTM

This revision is now accepted and ready to land.Sep 2 2019, 6:40 AM
  1. b) It might not be a bad idea to add a test to test-suite that will do some fast division vs. non-fast division to make sure the accuracy isn't too bad
  2. We should do some thorough analysis of the accuracy of the algorithm vs. the HW implementation on a wide range of values on a couple of targets (as was suggested above). I know this was done on PPC, so please share those results and try to replicate on another easily available target (such as X86).

I think these two points weren't addressed.
I'd like to see at least some publicly-stated numbers on accuracy,
just so we all know this is going in the right direction for all inputs.

spatel requested changes to this revision.Sep 2 2019, 7:11 AM
  1. b) It might not be a bad idea to add a test to test-suite that will do some fast division vs. non-fast division to make sure the accuracy isn't too bad
  2. We should do some thorough analysis of the accuracy of the algorithm vs. the HW implementation on a wide range of values on a couple of targets (as was suggested above). I know this was done on PPC, so please share those results and try to replicate on another easily available target (such as X86).

I think these two points weren't addressed.
I'd like to see at least some publicly-stated numbers on accuracy,
just so we all know this is going in the right direction for all inputs.

Changing my 'accepted' until this is answered.

The test at:
https://github.com/ecnelises/fp-division-test/
...seems to do a small random sampling.

The original transform was tested on x86 using brute force for all possible floats (1.0f/x) and is attached here:
https://bugs.llvm.org/show_bug.cgi?id=21385

I'm not sure how to prove this, but by distributing the multiplication into the last step of the estimate, I think we are always trading better accuracy around the numerator value with potentially overflowing to infinity for extremely different numerator/denominator. That's a good trade-off IMO and within the loosely-defined behavior enabled by 'arcp' in LLVM and '-mrecip' with Clang.

This revision now requires changes to proceed.Sep 2 2019, 7:11 AM
qiucf added a comment.Sep 3 2019, 12:09 AM

I think these two points weren't addressed.
I'd like to see at least some publicly-stated numbers on accuracy,
just so we all know this is going in the right direction for all inputs.

Changing my 'accepted' until this is answered.

The test at:
https://github.com/ecnelises/fp-division-test/
...seems to do a small random sampling.

The original transform was tested on x86 using brute force for all possible floats (1.0f/x) and is attached here:
https://bugs.llvm.org/show_bug.cgi?id=21385

I'm not sure how to prove this, but by distributing the multiplication into the last step of the estimate, I think we are always trading better accuracy around the numerator value with potentially overflowing to infinity for extremely different numerator/denominator. That's a good trade-off IMO and within the loosely-defined behavior enabled by 'arcp' in LLVM and '-mrecip' with Clang.

Thanks for test case in PR21385. I'll write tests on a wider range of numbers. We, from my point of view, need two kind of tests:

  • A compiler-independent program showing _distributing the multiplication into the last step of estimation_ is really more accurate. It shoule be just like the case you showed.
  • A program with functions optimized at different level (e.g. -Ofast and -O3) comparing results of them with real divisions. This can originate from my previous fp-division-test. I think this is suitable for test suites.

Result of test should include:

  • Accuracy (< 2ulp?) rate compared with real divisions.
  • Accuracy rate compared with current implementation.
  • Accuracy rate compared with other implementations, such as GCC.

A problem here: iterate from 0x00800000 to 0x7E800000 is acceptable for testing reciprocals, but not for testing divisions (n^2). I'm not sure changing iteration step from 1 to 10, 100 or larger to reduce running time is okay.

qiucf added a comment.EditedSep 5 2019, 1:02 AM

I updated a test for testing this new way of division estimations. It's posted at https://github.com/ecnelises/fp-division-test/blob/master/algorithm_test.c so people can do test by their own. Here are my accuracy results:

(Here OLD means do iteration once and multiply)

  • OLD: 79.47% (7600728294/9563620427), NEW: 99.92% (9555976854/9563620427) with enumeration steps 21277, 21961
  • OLD: 79.51% (17982192012/22615809720), NEW: 99.92% (22597759561/22615809720) with enumeration steps 11587, 17053
  • OLD: 79.52% (72788895710/91532258037), NEW: 99.92% (73098710/91532258037) with enumeration steps 6037, 8087
  • OLD: 79.50% (175651525048/220935684644), NEW: 99.92% (220759033938/220935684644), with enumeration steps 5471, 3697
  • OLD: 79.53% (1807487453401/2272767717744), NEW: 99.92% (2270951449072/2272767717744), with enumeration steps 1523, 1291
  • OLD: 79.52% (4700786248799/5911547429304), NEW: 99.92% (5906822137336/5911547429304), with enumeration steps 983, 769

Inaccurate stands for that error is larger than 0.5ulp. Also I have some other sources in that repository which can be used to test division accuracy after Ofast. I think they can be good test suite candidate.

Results here are quite stable to show that the new way has better accuracy.

qiucf updated this revision to Diff 219291.Sep 9 2019, 1:17 AM

Update patch to fix check regressions from recent commits.

I updated a test for testing this new way of division estimations. It's posted at https://github.com/ecnelises/fp-division-test/blob/master/algorithm_test.c so people can do test by their own. Here are my accuracy results:

(Here OLD means do iteration once and multiply)

  • OLD: 79.47% (7600728294/9563620427), NEW: 99.92% (9555976854/9563620427) with enumeration steps 21277, 21961
  • OLD: 79.51% (17982192012/22615809720), NEW: 99.92% (22597759561/22615809720) with enumeration steps 11587, 17053
  • OLD: 79.52% (72788895710/91532258037), NEW: 99.92% (73098710/91532258037) with enumeration steps 6037, 8087
  • OLD: 79.50% (175651525048/220935684644), NEW: 99.92% (220759033938/220935684644), with enumeration steps 5471, 3697
  • OLD: 79.53% (1807487453401/2272767717744), NEW: 99.92% (2270951449072/2272767717744), with enumeration steps 1523, 1291
  • OLD: 79.52% (4700786248799/5911547429304), NEW: 99.92% (5906822137336/5911547429304), with enumeration steps 983, 769

    Inaccurate stands for that error is larger than 0.5ulp. Also I have some other sources in that repository which can be used to test division accuracy after Ofast. I think they can be good test suite candidate.

    Results here are quite stable to show that the new way has better accuracy.

Thank you for posting these.
I don't know if the step is fine-grained enough, but these numbers look good.

LGTM for x86. I tried the test program on Haswell and see the expected improvements (better results with FMA as also expected).
cc'ing some potential AMDGPU reviewers.

AMDGPU test changes looks good.

jsji requested changes to this revision.Sep 9 2019, 2:56 PM

Please include *Full Context* in newer diff . Thanks.

This revision now requires changes to proceed.Sep 9 2019, 2:56 PM
qiucf updated this revision to Diff 219469.Sep 9 2019, 8:05 PM

Upload patch with full context.

qiucf updated this revision to Diff 219475.Sep 9 2019, 9:08 PM

Remove unexpected changed file caused by newline characters.

It would appear that we are converging here. Both @jsji and @spatel are presumably in the "Must Review" for this. Would you guys mind taking another look as your approval is necessary for this to proceed.

spatel accepted this revision.Sep 11 2019, 5:38 AM

LGTM

jsji accepted this revision.Sep 11 2019, 8:19 AM

LGTM.

This revision is now accepted and ready to land.Sep 11 2019, 8:19 AM
This revision was automatically updated to reflect the committed changes.