Page MenuHomePhabricator

[test-suite] Run FP tests twice with -ffp-contract=off/on
AbandonedPublic

Authored by sebpop on Oct 5 2016, 6:26 AM.

Details

Summary

With this patch all the tests specifying an FP_TOLERANCE or FP_ABSTOLERANCE will be run in two stages:

  • first step is with with extra flags "-ffp-contract=off -fno-fast-math", and recording the full output that will be checked against in the second step
  • second step is with the default flags specified by the user, and checks against the output recorded in the first step with the FP_TOLERANCE specified by the test.

There are at least two main problems with this approach:

  • some of the current golden files in the test-suite require testing with FP_TOLERANCE even for stage1
  • the build directory goes from 300M to 1.2G due to the extra reference outputs recorded by stage1, and for at least 2 tests (polybench/bicg, polybench/mvt disabled in the current patch) fpcmp takes more than 10 minutes to completely parse the outputs of stage1 and 2. When running test-suite over small devices will cost 1G more transfer over the network.

The current implementation has another smaller technical issue:
lit does not allow to specify dependences between tests.
When running "lit -j8" one can end up running stage 2 before stage 1.
The solution I was considering is to run step 1 as a PREPARE: step,
though that would fail to add the exit code in the output,
and that will end up mismatching the hash.

Diff Detail

Event Timeline

sebpop updated this revision to Diff 73639.Oct 5 2016, 6:26 AM
sebpop retitled this revision from to [test-suite] Run FP tests twice with -ffp-contract=off/on.
sebpop updated this object.
sebpop added reviewers: rengolin, cmatthews, MatzeB, Abe.
sebpop added a subscriber: llvm-commits.
spatel added a subscriber: spatel.Oct 5 2016, 6:43 AM
MatzeB edited edge metadata.Oct 5 2016, 11:23 AM
MatzeB added a subscriber: scanon.

Ugh, I don't like the added complexity and disk space bloat.

Is there no way to summarize/hash/compress a list of floating point numbers in a way that we can compare them against a reference output and at the same time allow some fluctuations in floatingpoint accuracy?

sebpop added a comment.Oct 5 2016, 1:17 PM

Ugh, I don't like the added complexity and disk space bloat.

Is there no way to summarize/hash/compress a list of floating point numbers in a way that we can compare them against a reference output and at the same time allow some fluctuations in floatingpoint accuracy?

If those tests were written with testing with FP_TOLERANCE in mind, we won't have all those problems.
In the case of Polybench in general and specifically for bicg and mvt, the tests are dumping 4000 by 4000 double matrices, which is ridiculous, all Polybench should be restricted to only compare against a an exact reference hash. These tests (and maybe others) should only run under -ffp-contract=off.

rengolin edited edge metadata.Oct 6 2016, 8:29 AM

With this patch all the tests specifying an FP_TOLERANCE or FP_ABSTOLERANCE will be run in two stages:

first step is with with extra flags "-ffp-contract=off -fno-fast-math", and recording the full output that will be checked against in the second step
second step is with the default flags specified by the user, and checks against the output recorded in the first step with the FP_TOLERANCE specified by the test.

This doesn't make sense. If there's an FP bug, it can manifest on both sides, meaning we'd only be catching contract problems, not FP problems.

The proposed way was to have the exact same reference output and compare with a higher FP tolerance when running with fp-contract=on.

There are at least two main problems with this approach:

some of the current golden files in the test-suite require testing with FP_TOLERANCE even for stage1

Of course, as expected.

the build directory goes from 300M to 1.2G due to the extra reference outputs recorded by stage1, and for at least 2 tests (polybench/bicg, polybench/mvt disabled in the current patch) fpcmp takes more than 10 minutes to completely parse the outputs of stage1 and 2. When running test-> suite over small devices will cost 1G more transfer over the network.

This is not acceptable.

The current implementation has another smaller technical issue:
lit does not allow to specify dependences between tests.
When running "lit -j8" one can end up running stage 2 before stage 1.

Ouch, that's horrible!

The solution I was considering is to run step 1 as a PREPARE: step,
though that would fail to add the exit code in the output,
and that will end up mismatching the hash.

That's way too complex. Why not have two different tolerances for each run?

hfinkel added a subscriber: hfinkel.Oct 6 2016, 8:43 AM

With this patch all the tests specifying an FP_TOLERANCE or FP_ABSTOLERANCE will be run in two stages:

first step is with with extra flags "-ffp-contract=off -fno-fast-math", and recording the full output that will be checked against in the second step
second step is with the default flags specified by the user, and checks against the output recorded in the first step with the FP_TOLERANCE specified by the test.

This doesn't make sense. If there's an FP bug, it can manifest on both sides, meaning we'd only be catching contract problems, not FP problems.

The proposed way was to have the exact same reference output and compare with a higher FP tolerance when running with fp-contract=on.

There are at least two main problems with this approach:

some of the current golden files in the test-suite require testing with FP_TOLERANCE even for stage1

Of course, as expected.

the build directory goes from 300M to 1.2G due to the extra reference outputs recorded by stage1, and for at least 2 tests (polybench/bicg, polybench/mvt disabled in the current patch) fpcmp takes more than 10 minutes to completely parse the outputs of stage1 and 2. When running test-> suite over small devices will cost 1G more transfer over the network.

This is not acceptable.

We should look at making the polybench run on smaller data sets for these kinds of comparisons. As I recall, we're running into this problem because the current polybench setups are designed to run in a regime where the tests are memory-bandwidth bound. This is where polyhedral transformations are often profitable, which is good, but this is not as useful for this kind of regressions testing.

We can have large polybench tests, this is useful for performance testing of higher-order transformations. They should, however, hash their own output so we have an easy comparison. We can do full dumps on smaller-sized runs. As I've suggested previously, we should save the output of the IEEE-strict run, compare to the hashed output, and then compile/run in the default mode, and compare that to the IEEE-strict output with a higher tolerance.

The current implementation has another smaller technical issue:
lit does not allow to specify dependences between tests.
When running "lit -j8" one can end up running stage 2 before stage 1.

Ouch, that's horrible!

I think that running all of the configurations of a benchmark should happen as one "test", in the lit sense. i.e. the test file that is generated should have multiple run lines.

The solution I was considering is to run step 1 as a PREPARE: step,
though that would fail to add the exit code in the output,
and that will end up mismatching the hash.

That's way too complex. Why not have two different tolerances for each run?

No, we can have multiple RUN lines.

sebpop abandoned this revision.Oct 19 2016, 6:48 AM