This is an archive of the discontinued LLVM Phabricator instance.

[Fortran] Relax relative tolerance for FCVS tests
ClosedPublic

Authored by rovka on Jun 21 2022, 5:16 AM.

Details

Summary

In one of the tests from FM905.f, flang and gfortran disagree about how
0.25e-10 should be printed: flang prints 2.5e-11, whereas gfortran
prints 2.50000003E-11 (*). Currently, the reference output contains the
former, and the relative tolerance is low enough that the latter is being
reported as an error. However, both values are reasonable results, so
this patch relaxes the tolerance for the FCVS tests to permit either value
when FCVS_ALLOW_FLEXIBLE_OUTPUT is enabled.

The value is implicitly typed in the source code, so it should use the
default real type (which can be either single or double precision
depending on the compilation flags).

In single precision, the closest representable value to 2.5e-11 is
2.500000033378579900045224349014461040496826171875E-11, which is the
same as the closest representable value for 2.50000003E-11, so both
outputs are valid.

In double precision, the closest value to 2.5e-11 is
2.50000000000000009108049328874435394791386766399909902247600257396697998046875E-11,
whereas the closest value to 2.50000003E-11 is
2.5000000300000001083984497912273228283075443556526806787587702274322509765625E-11.

Since fpcmp treats everything as double precision, it errors out in
gfortran's case because the reference output contains the minimum
number of decimals that would produce the same single precision
value. It also fails for either compiler when -fdefault-real-8 is passed:

test-suite/tools/fpcmp-target:
Compared: 2.500000e-11 and 2.500000e-11
abs. diff = 3.000000e-19 rel.diff = 1.200000e-08
Out of tolerance: rel/abs: 1.000000e-08/0.000000e+00

With the new tolerance, both outputs are considered valid.

(*) These results were obtained on an AArch64 Ubuntu machine.

Diff Detail

Event Timeline

rovka created this revision.Jun 21 2022, 5:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2022, 5:16 AM
rovka requested review of this revision.Jun 21 2022, 5:16 AM

What you call the "true value" of 2.5E-11 cannot be represented exactly as a binary floating-point number. The closest real(4) value is exactly 2.500000033378579900045224349014461040496826171875E-11 and the closest real(8) value is 2.50000000000000009108049328874435394791386766399909902247600257396697998046875E-11.

The closest real(4) value of 2.50000003E-11 is the same as the closest real(4) value of 2.5E-11, but the closest real(8) value of 2.50000003E-11 is 2.5000000300000001083984497912273228283075443556526806787587702274322509765625E-11, which is not the same as the closest real(8) value of 2.5E-11.

For E0/D0/G0, list-directed, and NAMELIST output editing, f18 emits minimal digit sequences that, if read back into a real variable of the same kind, will reproduce the output value. Perhaps that is affecting this test. "2.5E-11" would be valid E0/list-directed output for the real(4) value closest to 2.5E-11, but not the real(8) value.

I suspect that your fpcmp tool is reading 2.5E-11 and 2.50000003E-11 in as double-precision values, and is rightfully complaining about their difference, which would indeed indicate an error if this output were of double precision values. Please verify that this is the case.

rovka edited the summary of this revision. (Show Details)Jun 22 2022, 3:01 AM
rovka added a comment.Jun 22 2022, 3:09 AM

Thanks for the explanation! I have updated the patch description with all the information. fpcmp does treat everything as double precision.

rovka updated this revision to Diff 439355.Jun 23 2022, 5:35 AM

Relax the tolerance only for FM905 and FM907 (which are already special-cased in D128260).

awarzynski added a comment.EditedJun 23 2022, 5:52 AM

Thanks for updating this!

Thanks for the explanation! I have updated the patch description with all the information. fpcmp does treat everything as double precision.

This makes sense to me. And indeed, fpcmp uses double for all floating point values.

relaxes the tolerance for the FCVS tests to permit both values, since they are both close enough (the absolute difference is small)

Yeah, the absolute difference is negligible. Note that FP_TOLERANCE is used as the relative tolerance when passed to fpcmp (see fpcmp invocation here and the help text from fpcmp documents its arguments). I think that 1e-7 as relative tolerance gives enough precision here. 1e-11 was too aggressive IMO - perhaps that was meant as absolute tolerance (which, btw, I don't see being specified)?

rovka added a comment.Jun 29 2022, 1:04 AM

Thanks for updating this!

Thanks for the explanation! I have updated the patch description with all the information. fpcmp does treat everything as double precision.

This makes sense to me. And indeed, fpcmp uses double for all floating point values.

relaxes the tolerance for the FCVS tests to permit both values, since they are both close enough (the absolute difference is small)

Yeah, the absolute difference is negligible. Note that FP_TOLERANCE is used as the relative tolerance when passed to fpcmp (see fpcmp invocation here and the help text from fpcmp documents its arguments). I think that 1e-7 as relative tolerance gives enough precision here. 1e-11 was too aggressive IMO - perhaps that was meant as absolute tolerance (which, btw, I don't see being specified)?

To be honest I haven't pondered all the nuances of absolute tolerance vs relative tolerance too much. I'm not a FP expert, so I tried to get away with what seemed to me to be the least aggressive change :)

rovka updated this revision to Diff 440889.Jun 29 2022, 1:06 AM
rovka set the repository for this revision to rT test-suite.

Rebase.

This change makes sense to me, thanks! Could you also update the summary? There's quite a lot of detail there, which can be skipped. Also, could you highlight that you are updating the relative tolerance here?

Fortran/UnitTests/fcvs21_f95/CMakeLists.txt
43

Could you add a comment saying that this is the relative tolerance? I think that this is key here.

rovka updated this revision to Diff 441314.Jun 30 2022, 2:12 AM
rovka retitled this revision from [Fortran] Relax tolerance for FCVS tests to [Fortran] Relax relative tolerance for FCVS tests.
rovka edited the summary of this revision. (Show Details)

Update summary & comment.

rovka set the repository for this revision to rT test-suite.Jun 30 2022, 5:42 AM

so this patch does 2 things:

🤔 I see only one change :)

Some general remarks about floating point:

  • Floating point formatters have different approached they output. One approach is to emit the decimal representation that is closest to the number represented by the IEEE bit pattern. Because this can lead to a lot of trailing digits in scientific notations for what should be "0,1" exactly, a more recent approach is to find the decimal representations with the fewest number of digits that would still be parsed back to the same IEEE bit pattern (as @rovka mentions, but the implementation doesn't seem to do that). So it is possible that 2.50000003e-11 is the result of the first approach, where 2.5e-11 follows the second approach (assuming single precision).
  • fpcmp was not written with whether two string representations parse to the same IEEE bit pattern in mind. Only to account for the general fuzziness that comes with floating point processing, including effects through -ffast-math, transcendental functions that only compute approximations, during string formatting, etc.. Considering that single only has ~7.2 significant mantissa bits, a relative tolerance of 1.0e-11 effectively only allows exact matches.
  • I think the tests should indeed consider the differences in floating point formatting as valid, and if single precision is a valid implementation choice for a Fortran compiler, increasing the tolerance seems right to me. I'd even go as far to increase it for all tests since that kind of approximation could occur with the other tests as well.

so this patch does 2 things:

🤔 I see only one change :)

A previous version also updated the reference output to "2.5e-11" as well, which you get you run driver_run using flang to update all reference outputs. With the fpcmp tolerance it does not matter which one is in the reference file, but I assume "2.5e-11" would be closer to what the result would be with infinite (/double?) precision and hence IMHO the one that we should put into the reference output.

Some general remarks about floating point:

  • Floating point formatters have different approached they output. One approach is to emit the decimal representation that is closest to the number represented by the IEEE bit pattern. Because this can lead to a lot of trailing digits in scientific notations for what should be "0,1" exactly, a more recent approach is to find the decimal representations with the fewest number of digits that would still be parsed back to the same IEEE bit pattern (as @rovka mentions, but the implementation doesn't seem to do that). So it is possible that 2.50000003e-11 is the result of the first approach, where 2.5e-11 follows the second approach (assuming single precision).

See above. f18's runtime will emit a minimal representation that reads back to the same bit representation (of the same real kind) when Fortran allows it to do so (viz., list-directed and NAMELIST output).

The binary-to-decimal conversion algorithm in flang/lib/Decimal is not the whole story. See the code in flang/runtime/edit-output.cpp to see how the Decimal library is driven for all of Fortran's various output editing modes.

  • fpcmp was not written with whether two string representations parse to the same IEEE bit pattern in mind. Only to account for the general fuzziness that comes with floating point processing, including effects through -ffast-math, transcendental functions that only compute approximations, during string formatting, etc.. Considering that single only has ~7.2 significant mantissa bits, a relative tolerance of 1.0e-11 effectively only allows exact matches.

IEC599/IEEE754 32-bit single precision has 24 effective fractional bits, enough to accurately distinguish 7 decimal digits.

  • I think the tests should indeed consider the differences in floating point formatting as valid, and if single precision is a valid implementation choice for a Fortran compiler, increasing the tolerance seems right to me. I'd even go as far to increase it for all tests since that kind of approximation could occur with the other tests as well.

The hard part is that you can't tell from the output whether the runtime was editing single-precision data (or half, double, extended, or quad precision), and if you relax tolerances to allow single precision to differ, you can miss errors in double. FPCMP probably can't detect output formatting errors of quad-precision data as it is.

See above. f18's runtime will emit a minimal representation that reads back to the same bit representation (of the same real kind) when Fortran allows it to do so (viz., list-directed and NAMELIST output).

The binary-to-decimal conversion algorithm in flang/lib/Decimal is not the whole story. See the code in flang/runtime/edit-output.cpp to see how the Decimal library is driven for all of Fortran's various output editing modes.

Thanks for this pointing out. I don't pretend to know how it works, but having also looked into the state of the art reference implementation Ryu, I expected it to require more code.

  • fpcmp was not written with whether two string representations parse to the same IEEE bit pattern in mind. Only to account for the general fuzziness that comes with floating point processing, including effects through -ffast-math, transcendental functions that only compute approximations, during string formatting, etc.. Considering that single only has ~7.2 significant mantissa bits, a relative tolerance of 1.0e-11 effectively only allows exact matches.

IEC599/IEEE754 32-bit single precision has 24 effective fractional bits, enough to accurately distinguish 7 decimal digits.

My mistake, I meant "~7.2 significant mantissa decimal digits" which matches your statement.

The hard part is that you can't tell from the output whether the runtime was editing single-precision data (or half, double, extended, or quad precision), and if you relax tolerances to allow single precision to differ, you can miss errors in double.

My impression was that the tests don't require a specific decision, hence should not expect results to conform to more than the minimum precision. If we want to test higher precisions, these could be separate tests.

FPCMP probably can't detect output formatting errors of quad-precision data as it is.

Correct. However, fpcmp was not intended for IEEE-conformance tests, but for "not obviously wrong" tests. It started when SPEC CPU was integrated and an equivalent for spec's specdiff was needed (which is a Perl script also using double precision).

See above. f18's runtime will emit a minimal representation that reads back to the same bit representation (of the same real kind) when Fortran allows it to do so (viz., list-directed and NAMELIST output).

The binary-to-decimal conversion algorithm in flang/lib/Decimal is not the whole story. See the code in flang/runtime/edit-output.cpp to see how the Decimal library is driven for all of Fortran's various output editing modes.

Thanks for this pointing out. I don't pretend to know how it works, but having also looked into the state of the art reference implementation Ryu, I expected it to require more code.

The requirements for Fortran's binary-to-decimal and decimal-to-binary conversions are unfortunately weird compared to what Ryu can do; we need rounding to arbitrary fixed precision, kP scaling, rounding modes beyond IEEE754's, and support for at least six distinct formats, including (for cross-compilation) some that may not have host ISA support. Hence the general conversion engine templates in lib/Decimal, which are used to do conversions in semantics and module file output as well as in the runtime library.

rovka updated this revision to Diff 441633.Jul 1 2022, 2:00 AM

Right, sorry, I don't know why the other change disappeared. It should be back now.
Regarding the other tests, AFAICT they don't use list-directed or NAMELIST output, so having stricter checks might (?) actually catch real problems. But really I don't know much about the subtleties of Fortran FP formats. @klausler Should we update the tolerance for all tests now or wait and treat each issue we might run into separately?

Right, sorry, I don't know why the other change disappeared. It should be back now.
Regarding the other tests, AFAICT they don't use list-directed or NAMELIST output, so having stricter checks might (?) actually catch real problems. But really I don't know much about the subtleties of Fortran FP formats. @klausler Should we update the tolerance for all tests now or wait and treat each issue we might run into separately?

You should capture the current f18 output and require exact matching, with no allowance for floating-point (or other) differences.

My conclusion is different. If gfortran's output is correct according to the Fortran standard, it should pass the test as well. Running the test-suite with other compilers than Clang/Flang was always helpful in testing the integrity of the tests themselves.

My conclusion is different. If gfortran's output is correct according to the Fortran standard, it should pass the test as well. Running the test-suite with other compilers than Clang/Flang was always helpful in testing the integrity of the tests themselves.

Standards often have implementation-dependent features; Fortran I/O is chock full of them, especially in the area of list-directed output. Standard-conforming implementations can produce quite distinct output from standard-conforming programs.

So you can relax tolerances for comparison and make this copy of FCVS more useful for testing multiple Fortran compilers, or you can install f18-specific output files and be more useful for catching inadvertent changes to output, but I don't think that you can do both.

rovka added a comment.Jul 26 2022, 3:28 AM

My conclusion is different. If gfortran's output is correct according to the Fortran standard, it should pass the test as well. Running the test-suite with other compilers than Clang/Flang was always helpful in testing the integrity of the tests themselves.

Standards often have implementation-dependent features; Fortran I/O is chock full of them, especially in the area of list-directed output. Standard-conforming implementations can produce quite distinct output from standard-conforming programs.

So you can relax tolerances for comparison and make this copy of FCVS more useful for testing multiple Fortran compilers, or you can install f18-specific output files and be more useful for catching inadvertent changes to output, but I don't think that you can do both.

Agreed, we can't do both. I think the spirit of the test-suite is to support testing multiple compilers. Inadvertent changes to output should be caught by unit tests. The test-suite is more meant to make sure we output something valid, and in some cases, like this one, there are lots of different things that can be considered valid.

My conclusion is different. If gfortran's output is correct according to the Fortran standard, it should pass the test as well. Running the test-suite with other compilers than Clang/Flang was always helpful in testing the integrity of the tests themselves.

Standards often have implementation-dependent features; Fortran I/O is chock full of them, especially in the area of list-directed output. Standard-conforming implementations can produce quite distinct output from standard-conforming programs.

So you can relax tolerances for comparison and make this copy of FCVS more useful for testing multiple Fortran compilers, or you can install f18-specific output files and be more useful for catching inadvertent changes to output, but I don't think that you can do both.

Agreed, we can't do both. I think the spirit of the test-suite is to support testing multiple compilers. Inadvertent changes to output should be caught by unit tests. The test-suite is more meant to make sure we output something valid, and in some cases, like this one, there are lots of different things that can be considered valid.

So if I want the LLVM source base to have a Fortran test suite that tests the one Fortran compiler in the LLVM source base, I'm out of luck and will need to set that up elsewhere?

My conclusion is different. If gfortran's output is correct according to the Fortran standard, it should pass the test as well. Running the test-suite with other compilers than Clang/Flang was always helpful in testing the integrity of the tests themselves.

Standards often have implementation-dependent features; Fortran I/O is chock full of them, especially in the area of list-directed output. Standard-conforming implementations can produce quite distinct output from standard-conforming programs.

So you can relax tolerances for comparison and make this copy of FCVS more useful for testing multiple Fortran compilers, or you can install f18-specific output files and be more useful for catching inadvertent changes to output, but I don't think that you can do both.

Agreed, we can't do both. I think the spirit of the test-suite is to support testing multiple compilers. Inadvertent changes to output should be caught by unit tests. The test-suite is more meant to make sure we output something valid, and in some cases, like this one, there are lots of different things that can be considered valid.

So if I want the LLVM source base to have a Fortran test suite that tests the one Fortran compiler in the LLVM source base, I'm out of luck and will need to set that up elsewhere?

More or less. Note that the llvm-test-suite lives in a separate repository. If you want something that's specifically testing flang and flang only, then that can indeed live in the LLVM source base (i.e. the monorepo).

My preference is that the reference output match the output of flang with the default settings. Would that work with the gfortran + the provided FP_TOLERANCE? The consensus was not to include fcvs in the llvm-project tree; we can revisit that decision if you'd like to start a thread in discourse or on of the weekly calls.

Regarding FP_TOLERANCE: We know that precision varies not only from compiler to compiler but also when using different command-line options. Again, my preference would be to have the committed files match flang with its default options. When testing other compilers, or other option sets, the appropriate value could be supplied through CMake.

rovka added a comment.Jul 27 2022, 3:07 AM

My preference is that the reference output match the output of flang with the default settings. Would that work with the gfortran + the provided FP_TOLERANCE? The consensus was not to include fcvs in the llvm-project tree; we can revisit that decision if you'd like to start a thread in discourse or on of the weekly calls.

Regarding FP_TOLERANCE: We know that precision varies not only from compiler to compiler but also when using different command-line options. Again, my preference would be to have the committed files match flang with its default options. When testing other compilers, or other option sets, the appropriate value could be supplied through CMake.

Ok, thanks for chiming in! I'm just going to update the reference outputs to match what flang does, and we can worry about gfortran later.

rovka updated this revision to Diff 452113.Aug 12 2022, 2:27 AM
rovka edited the summary of this revision. (Show Details)
rovka set the repository for this revision to rT test-suite.

Finally updated! Sorry it took so long, I was on vacation and then busy with the LLVM 15 release.

Thanks for continuing to work on this! Just to clarify:

it might be worth clarifying in the summary too.

This won't be true after https://reviews.llvm.org/D131754, right? (I mean the bit in bold)

Since fpcmp treats everything as double precision, it errors out in flang's case because flang is printing the minimum number of decimals that would produce the same single precision value

Could you replace "this" with "the output from gfortran":

Currently, the reference output contains the former, and the relative tolerance is low enough that this is being reported as an error.

Sorry for being picky, but this is quite fiddly and easy to get wrong.

Fortran/UnitTests/fcvs21_f95/CMakeLists.txt
43

I'd update this comment to clarify that this was introduced for "gfortran" specifically.

rovka added a comment.Aug 15 2022, 4:50 AM

Thanks for continuing to work on this! Just to clarify:

Yes.

  • Is this fix specifically for "gfortran" (i.e. required for "FM905.f" to pass when using "gfortran")?

Yes, this is for gfortran now (I haven't tested other compilers). It also helps with flang if we're passing -fdefault-real-8, although that's not a scenario I'm planning to put effort into supporting right now.

it might be worth clarifying in the summary too.

This won't be true after https://reviews.llvm.org/D131754, right? (I mean the bit in bold)

Right, sorry! I thought I had updated everything...

Since fpcmp treats everything as double precision, it errors out in flang's case because flang is printing the minimum number of decimals that would produce the same single precision value

Could you replace "this" with "the output from gfortran":

Currently, the reference output contains the former, and the relative tolerance is low enough that this is being reported as an error.

Sorry for being picky, but this is quite fiddly and easy to get wrong.

No worries, this is good feedback, I'll update!

rovka edited the summary of this revision. (Show Details)Aug 15 2022, 5:09 AM
rovka updated this revision to Diff 452636.Aug 15 2022, 5:23 AM
rovka edited the summary of this revision. (Show Details)
awarzynski accepted this revision.Aug 22 2022, 4:06 AM

Thanks for addressing my comments and for seeing this through!

This revision is now accepted and ready to land.Aug 22 2022, 4:06 AM
This revision was landed with ongoing or failed builds.Aug 23 2022, 12:40 AM
This revision was automatically updated to reflect the committed changes.