This is an archive of the discontinued LLVM Phabricator instance.

[libc] Implement correct rounding with all rounding modes for hypot functions.
ClosedPublic

Authored by lntue on Jan 18 2022, 11:12 AM.

Details

Summary

Update the rounding logic for generic hypot function so that it will round correctly with all rounding modes.

Diff Detail

Event Timeline

lntue created this revision.Jan 18 2022, 11:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 18 2022, 11:12 AM
lntue requested review of this revision.Jan 18 2022, 11:12 AM

LGTM from a formatting perspective

Dear Tue,

Update the rounding logic for generic hypot function so that it will round correctly with all rounding modes.

I did stress this revision with my random testing code
(https://gitlab.inria.fr/zimmerma/math_accuracy) and did not find any
incorrectly rounded results, for all rounding modes, both for binary32
and binary64.

Instead of testing get_round() == FE_UPWARD to know the current rounding
mode, I wonder whether a test like 0x1p0f + 0x1p-24f != 0x1p0f would be
faster.

Paul

lntue added a comment.Jan 19 2022, 7:01 AM

Dear Tue,

Update the rounding logic for generic hypot function so that it will round correctly with all rounding modes.

I did stress this revision with my random testing code
(https://gitlab.inria.fr/zimmerma/math_accuracy) and did not find any
incorrectly rounded results, for all rounding modes, both for binary32
and binary64.

Instead of testing get_round() == FE_UPWARD to know the current rounding
mode, I wonder whether a test like 0x1p0f + 0x1p-24f != 0x1p0f would be
faster.

Paul

Thanks Paul for testing the patch!

I've tried with using the 0x1p0f + 0x1p-24f != 0x1p0f instead of get_round() == FE_UPWARD, and it does make the perf tests on normal range ~ 5% faster.
But this is due to the compiler optimized away the expression (making it always False), and in turn, making the function not correctly rounded for all rounding modes any more: https://godbolt.org/z/87z4bWE9P

And if feel like if we add extra stuff to prevent the compiler from optimizing the expression away, it would bring the performance back to what we got with get_round() == FE_UPWARD.
This is also belong to the exceptional cases where we short-circuit the results, and so at least any changes inside would not affect the worst case performance.

Dear Tue,

I've tried with using the 0x1p0f + 0x1p-24f != 0x1p0f instead of get_round() == FE_UPWARD, and it does make the perf tests on normal range ~ 5% faster.
But this is due to the compiler optimized away the expression (making it always False), and in turn, making the function not correctly rounded for all rounding modes any more: https://godbolt.org/z/87z4bWE9P

yes, you need to add -frounding-math to the compiler options.

And if feel like if we add extra stuff to prevent the compiler from optimizing the expression away, it would bring the performance back to what we got with get_round() == FE_UPWARD.
This is also belong to the exceptional cases where we short-circuit the results, and so at least any changes inside would not affect the worst case performance.

please can you try with -frounding-math and check if the perf tests are
faster and slower?

Paul

lntue updated this revision to Diff 401251.Jan 19 2022, 8:19 AM

Use a faster check for FE_UPWARD rounding mode.

lntue added a comment.Jan 19 2022, 8:21 AM

Dear Tue,

I've tried with using the 0x1p0f + 0x1p-24f != 0x1p0f instead of get_round() == FE_UPWARD, and it does make the perf tests on normal range ~ 5% faster.
But this is due to the compiler optimized away the expression (making it always False), and in turn, making the function not correctly rounded for all rounding modes any more: https://godbolt.org/z/87z4bWE9P

yes, you need to add -frounding-math to the compiler options.

And if feel like if we add extra stuff to prevent the compiler from optimizing the expression away, it would bring the performance back to what we got with get_round() == FE_UPWARD.
This is also belong to the exceptional cases where we short-circuit the results, and so at least any changes inside would not affect the worst case performance.

please can you try with -frounding-math and check if the perf tests are
faster and slower?

Paul

Thanks Paul! I've added -fround-math and it worked correctly while maintaining 5% overall performance improvement compared to get_round().

I've updated the patch accordingly. You might have to redo the stress tests again to make sure everything is still working properly.

Thanks!

lntue updated this revision to Diff 401323.Jan 19 2022, 10:18 AM

Remove warnings for perf tests.

sivachandra added inline comments.Jan 19 2022, 10:39 AM
libc/test/src/math/HypotTest.h
66

Is the call to func happening with the intended rounding mode?

lntue added inline comments.Jan 19 2022, 11:03 AM
libc/test/src/math/HypotTest.h
66

Yes, by putting the call inside the macro, it will be put by the macro after the rounding mode is set before it is evaluated. If it does not work as intended, the assertions with different rounding modes will definitely catch it, because most of the outputs are not representable in the floating point, and hence rounding up/down will guarantee to give different results.

sivachandra accepted this revision.Jan 19 2022, 11:07 AM

Please wait for @zimmermann6 to give his green light.

This revision is now accepted and ready to land.Jan 19 2022, 11:07 AM

I still get warnings with the latest revision:

/localdisk/zimmerma/llvm-project/libc/src/__support/FPUtil/Hypot.h:149:22: warning: hexadecimal floating literals are a C++17 feature [-Wc++17-extensions]
    if ((y != 0) && (0x1p0f + 0x1p-24f != 0x1p0f)) {
                     ^
zimmermann6 accepted this revision.Jan 19 2022, 11:23 PM

the stress tests were successful (for all four rounding modes, both in single and double precision).
Thus I am ok with this version, thanks!

zimmermann6 requested changes to this revision.Jan 20 2022, 2:52 AM

after fixing my stress program I was able to find one value which does not seem to be correctly rounded (for binary32 and rounding up):

zimmerma@biscotte:~/svn/tbd/20/src/binary32$ CFLAGS=-DCHECK_CR LLVM=llvm-project VERBOSE=-v RND=rndu ./doitb.llvm hypot 1000
Checking hypot with llvm-project and rndu
Using seed 1076573
NEW hypot 0 -1 0x1.ffffecp-1,-0x1.000002p+27 [1.00] 1 1
libm gives 0x1.000002p+27
mpfr gives 0x1.000004p+27

Please can you confirm?

This revision now requires changes to proceed.Jan 20 2022, 2:52 AM

I got similar results with binary64:

Checking hypot with llvm-project and rndu
Using seed 1078001
NEW hypot 0 -1 0x1.ccbbbcfef3c02p-523,0x1.924bf639c1a94p+500 [1.00] 1 1
libm gives 0x1.924bf639c1a94p+500
mpfr gives 0x1.924bf639c1a95p+500
lntue updated this revision to Diff 401620.Jan 20 2022, 6:39 AM

Fix compiler flags that were added to the wrong entrypoints.

lntue added a comment.Jan 20 2022, 6:42 AM

after fixing my stress program I was able to find one value which does not seem to be correctly rounded (for binary32 and rounding up):

zimmerma@biscotte:~/svn/tbd/20/src/binary32$ CFLAGS=-DCHECK_CR LLVM=llvm-project VERBOSE=-v RND=rndu ./doitb.llvm hypot 1000
Checking hypot with llvm-project and rndu
Using seed 1076573
NEW hypot 0 -1 0x1.ffffecp-1,-0x1.000002p+27 [1.00] 1 1
libm gives 0x1.000002p+27
mpfr gives 0x1.000004p+27

Please can you confirm?

Thanks Paul for catching the problem! For some reason before my previous patch update, merging to the head had put the -fround-math compiler flags into the wrong entrypoints that I didn't notice. After moving them back to hypotf and hypot, the results are correct on my side now.

zimmermann6 accepted this revision.Jan 20 2022, 7:31 AM

I'm ok with the new revision. However I see there are still some calls to get_round(). Did you try to replace them by floating-point operations?

You might also want to add the following hard-to-round cases (for binary32) in your test cases:

/* the following are hard-to-round cases with many identical bits after       
   the round bit */
{0x1.900004p+34,0x1.400002p+23}, /* 45 identical bits */
{0x1.05555p+34,0x1.bffffep+23},  /* 44 identical bits */
{0x1.e5fffap+34,0x1.affffep+23}, /* 45 identical bits */
{0x1.260002p+34,0x1.500002p+23}, /* 45 identical bits */
{0x1.fffffap+34,0x1.fffffep+23}, /* 45 identical bits */
{0x1.8ffffap+34,0x1.3ffffep+23}, /* 45 identical bits */
{0x1.87fffcp+35,0x1.bffffep+23}, /* 47 identical bits */

By the way, none of the other libraries is correctly rounded for binary32, here are the corresponding worst cases:

/* hypot(x,y) */
{0x1.b8e50ap-52,-0x1.db1e78p-64},   /* GNU libc       0.500001 */
{0x1.03b54cp-33,0x1.6ca6bep-45},    /* icc            0.500001 */
{0x1.e2eff6p+97,-0x1.044cb2p+108},  /* AMD LibM       0.500001 */
{-0x1.6b05c4p-127,0x1.6b3146p-126}, /* Newlib         1.20805 */
{-0x1.6b05c4p-127,0x1.6b3146p-126}, /* OpenLibm       1.20805 */
{0x1.26b188p-127,-0x1.a4f2fp-128},  /* Musl           0.926707 */
{0x1.e2eff6p+97,-0x1.044cb2p+108},  /* Darwin 20.4.0  0.500001 */
This revision is now accepted and ready to land.Jan 20 2022, 7:31 AM


attached is a file with 1200 binary32 exact cases with ulp(x)=2^12*ulp(y), x^2+y^2=z^2 having up to 72 bits. You might add them to your test cases.

lntue updated this revision to Diff 401666.Jan 20 2022, 9:14 AM

Add more hard-to-round tests.

lntue updated this revision to Diff 401673.Jan 20 2022, 9:40 AM

Add even more hard-to-round tests from Paul.

lntue added a comment.Jan 20 2022, 9:46 AM

I'm ok with the new revision. However I see there are still some calls to get_round(). Did you try to replace them by floating-point operations?

You might also want to add the following hard-to-round cases (for binary32) in your test cases:

/* the following are hard-to-round cases with many identical bits after       
   the round bit */
{0x1.900004p+34,0x1.400002p+23}, /* 45 identical bits */
{0x1.05555p+34,0x1.bffffep+23},  /* 44 identical bits */
{0x1.e5fffap+34,0x1.affffep+23}, /* 45 identical bits */
{0x1.260002p+34,0x1.500002p+23}, /* 45 identical bits */
{0x1.fffffap+34,0x1.fffffep+23}, /* 45 identical bits */
{0x1.8ffffap+34,0x1.3ffffep+23}, /* 45 identical bits */
{0x1.87fffcp+35,0x1.bffffep+23}, /* 47 identical bits */

By the way, none of the other libraries is correctly rounded for binary32, here are the corresponding worst cases:

/* hypot(x,y) */
{0x1.b8e50ap-52,-0x1.db1e78p-64},   /* GNU libc       0.500001 */
{0x1.03b54cp-33,0x1.6ca6bep-45},    /* icc            0.500001 */
{0x1.e2eff6p+97,-0x1.044cb2p+108},  /* AMD LibM       0.500001 */
{-0x1.6b05c4p-127,0x1.6b3146p-126}, /* Newlib         1.20805 */
{-0x1.6b05c4p-127,0x1.6b3146p-126}, /* OpenLibm       1.20805 */
{0x1.26b188p-127,-0x1.a4f2fp-128},  /* Musl           0.926707 */
{0x1.e2eff6p+97,-0x1.044cb2p+108},  /* Darwin 20.4.0  0.500001 */

Thanks Paul for the test cases! I've added these and the other 1.2k cases in your attachment to the test.

I have also tried replacing the final get_round with the corresponding floating point operations, but for some reason, -O3 always partially optimized it away even with -frounding-math flag together, and that make the results incorrect.
If I change the flag to -O2, then it is correct, but I see virtually no performance improvement. So I decided to keep the get_round() at the end so that it is easier to read and less sensitive to optimization flags.

Dear Tue,

I have also tried replacing the final get_round with the corresponding floating point operations, but for some reason, -O3 always partially optimized it away even with -frounding-math flag together, and that make the results incorrect.
If I change the flag to -O2, then it is correct, but I see virtually no performance improvement. So I decided to keep the get_round() at the end so that it is easier to read and less sensitive to optimization flags.

this might be a compiler bug, since -frounding-math should be supported
whatever the optimization level (-O2 or -O3). You might want to investigate
further and report the bug if any.

Best regards,
Paul

qiucf added a subscriber: qiucf.Jan 25 2022, 5:02 AM

this might be a compiler bug, since -frounding-math should be supported whatever the optimization level (-O2 or -O3). You might want to investigate further and report the bug if any.

Hi, what's the target in the context? Or can you show the partial IR after -frounding-math? Currently only X86/SystemZ/PowerPC in LLVM has rather complete support for rounding-mode-aware float operations.

lntue added a comment.Jan 25 2022, 6:28 AM

this might be a compiler bug, since -frounding-math should be supported whatever the optimization level (-O2 or -O3). You might want to investigate further and report the bug if any.

Hi, what's the target in the context? Or can you show the partial IR after -frounding-math? Currently only X86/SystemZ/PowerPC in LLVM has rather complete support for rounding-mode-aware float operations.

This is the error messages that we got on aarch64-ubuntu: Buildbot log

[1/3] Building CXX object projects/libc/src/math/generic/CMakeFiles/libc.src.math.generic.hypotf.dir/hypotf.cpp.o
clang-9: warning: optimization flag '-frounding-math' is not supported [-Wignored-optimization-argument]
[2/3] Building CXX object projects/libc/src/math/generic/CMakeFiles/libc.src.math.generic.hypot.dir/hypot.cpp.o
clang-9: warning: optimization flag '-frounding-math' is not supported [-Wignored-optimization-argument]

It looks like our buildbot for aarch64 uses clang-9 and -frounding-math is not supported there?

This is the error messages that we got on aarch64-ubuntu: Buildbot log https://lab.llvm.org/buildbot/#/builders/138/builds/16983/steps/4/logs/stdio

it would be better to print the inputs with %e, %g or %a in the checking code,
since with %f tiny inputs simply appear as 0:

Input decimal: x: 0.00000000000000000000000000000000000000000000000000 y: 1797693134862315708145274237317043567980705675258449965989174768031572607800285387605895586327668781715404589535143824642343213268894641827684675467035375169860499105765512820762454900903893289440758