This is an archive of the discontinued LLVM Phabricator instance.

[testsuite] turn fp-contract off because we can check those results more precisely
AbandonedPublic

Authored by spatel on Sep 26 2016, 9:48 AM.

Details

Summary

Motivation: we want to reinstate D24481 / rL282259 without breaking test-suite bots, so let's generalize the check that PowerPC is currently using to overcome the too-strict reference output checks. Ideally, we would do something like rL282027 to customize the checking per test, but let's not gate FMA codegen on that because it requires more investigation to get right.

From the earlier failures, we know that future improvements in this area will require changes for output-checking for at least these tests when running in a mode where fp-contraction is allowed:
FAIL: MultiSource/Applications/oggenc/oggenc.execution_time (490 of 2445)
FAIL: MultiSource/Benchmarks/MiBench/telecomm-FFT/telecomm-fft.execution_time (491 of 2445)
FAIL: MultiSource/Benchmarks/VersaBench/beamformer/beamformer.execution_time (492 of 2445)
FAIL: SingleSource/Benchmarks/Linpack/linpack-pc.execution_time (493 of 2445)
FAIL: SingleSource/Benchmarks/Misc-C++/Large/sphereflake.execution_time (494 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/datamining/correlation/correlation.execution_time (495 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/datamining/covariance/covariance.execution_time (496 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm.execution_time (497 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm.execution_time (498 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/atax/atax.execution_time (499 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/bicg/bicg.execution_time (500 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/cholesky/cholesky.execution_time (501 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemver/gemver.execution_time (502 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gesummv/gesummv.execution_time (503 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/mvt/mvt.execution_time (504 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.execution_time (505 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trisolv/trisolv.execution_time (506 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm.execution_time (507 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/solvers/gramschmidt/gramschmidt.execution_time (508 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/stencils/adi/adi.execution_time (509 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/stencils/fdtd-2d/fdtd-2d.execution_time (510 of 2445)

Diff Detail

Event Timeline

spatel updated this revision to Diff 72506.Sep 26 2016, 9:48 AM
spatel retitled this revision from to [testsuite] turn fp-contract off for ARM because output checking is not flexible enough.
spatel updated this object.
spatel added reviewers: Abe, rengolin, hfinkel, scanon, spop, yaxunl, MatzeB.
spatel added a subscriber: llvm-commits.
hfinkel edited edge metadata.Sep 26 2016, 9:54 AM

I think that you'll need the corresponding change in the CMake file too.

rengolin requested changes to this revision.Sep 26 2016, 10:00 AM
rengolin edited edge metadata.

I'm not sure what is the point of this patch. Can you describe what's broken and why this fixes it?

If this is related to the other patch reverted, than, as I said there, this needs a wider discussion and I'd be very surprised if we couldn't fix whatever the problems are on ARM instead of just completely disabling it and making ARM a different path from other architectures.

I really don't want us to start segregating the implementations per architecture, especially when nothing is broken yet.

cheers,
--renato

This revision now requires changes to proceed.Sep 26 2016, 10:00 AM
scanon edited edge metadata.Sep 26 2016, 10:56 AM

The tests in question are broken. They blindly apply a relative error tolerance that is not motivated by sound numerical analysis, despite the existence of well-known and well-founded error bounds for these problems, in every basic numerical analysis textbook.

That said, this doesn't actually fix the issue. It's a band-aid that allows the other patch to proceed. Since you seem to want to block that change in the short-term, there's probably not much point in proceeding with this one.

If someone has the time to devote to a correct fix for these tests, I can help them with the mathematically-correct error bounds.

I'm not sure what is the point of this patch. Can you describe what's broken and why this fixes it?

If this is related to the other patch reverted, than, as I said there, this needs a wider discussion and I'd be very surprised if we couldn't fix whatever the problems are on ARM instead of just completely disabling it and making ARM a different path from other architectures.

I really don't want us to start segregating the implementations per architecture, especially when nothing is broken yet.

I don't know what "segregating the implementations per architecture" means, and this patch does not disable any testing (just makes it test differently). Let me try to explain some more:

  1. rL282259 (like D14200 / rL253269 before it) proposed to make -ffp-contract=on the default.
  2. This allows FMA codegen for targets that have FMA instructions.
  3. Please don't confuse this with -ffast-math; this has nothing to do with -ffast-math.
  4. Obviously AArch64 has FMA instructions because r282259 caused AArch64 bots that run test-suite to fail and so it was reverted: rL282289.
  5. The output checking for some tests that use FP in test-suite appears to be looking for exact bit patterns and/or does not have an appropriate FP tolerance setting.
  6. This output checking is broken when we use FMA instructions because FMA instructions can produce different results than FMUL + FADD.
  7. To keep the bots green, I am proposing (really just re-proposing the suggestions of Sept 23 from the post-commit thread for r282259) that they run with -ffp-contract=off in order to match the (faulty) expectations contained in test-suite.
  8. I'm actually surprised that other architectures would not also fail test-suite for the same reason. Maybe it's just that there are no other FMA-capable bots running test-suite?
  9. It is not wrong for bots to run with a particular -ffp-contract setting; they are all perfectly valid modes of operation after all.
  10. Ideally, we would have more configs/bots to check each fp-contract mode. If we did that, then we'd also need to update the test-suite output checks to account for those different modes, but AFAIK nobody has volunteered to do that work.
  11. Another solution would be to loosen the checks (increase the FP tolerance) in test-suite for tests that are sensitive to FMA codegen, so they will pass independently of the -ffp-contract setting.
  12. This patch does not hinder anyone from doing the work for either #10 / #11.

I don't know what "segregating the implementations per architecture" means, and this patch does not disable any testing (just makes it test differently). Let me try to explain some more:

This is exactly what it means: testing different things for different architectures. If it used to work, and it's not working, disabling / redirecting the tests is *not* a solution.

  1. rL282259 (like D14200 / rL253269 before it) proposed to make -ffp-contract=on the default.
  2. This allows FMA codegen for targets that have FMA instructions.
  3. Please don't confuse this with -ffast-math; this has nothing to do with -ffast-math.
  4. Obviously AArch64 has FMA instructions because r282259 caused AArch64 bots that run test-suite to fail and so it was reverted: rL282289.

So does ARM and many other architectures.

  1. The output checking for some tests that use FP in test-suite appears to be looking for exact bit patterns and/or does not have an appropriate FP tolerance setting.
  2. This output checking is broken when we use FMA instructions because FMA instructions can produce different results than FMUL + FADD.

As expected.

  1. To keep the bots green, I am proposing (really just re-proposing the suggestions of Sept 23 from the post-commit thread for r282259) that they run with -ffp-contract=off in order to match the (faulty) expectations contained in test-suite.

And I'm strongly against any decision that disable features / testing patterns when trying to understand and fix the test / code is a viable solution. If everything was broken, I'd accept pushing hack patches in. This is not the case.

  1. I'm actually surprised that other architectures would not also fail test-suite for the same reason. Maybe it's just that there are no other FMA-capable bots running test-suite?

That is most likely the case. Or they just happen to not break on that particular situation. Relying on unknown behaviour is dangerous.

  1. It is not wrong for bots to run with a particular -ffp-contract setting; they are all perfectly valid modes of operation after all.

No, but this is a BOT option, not a TEST option. Unless there is *no* way to fix the test (which I haven't heard anything about it, yet), this could be one way of doing it.

But this is orthogonal to the decision itself.

  1. Ideally, we would have more configs/bots to check each fp-contract mode. If we did that, then we'd also need to update the test-suite output checks to account for those different modes, but AFAIK nobody has volunteered to do that work.
  2. Another solution would be to loosen the checks (increase the FP tolerance) in test-suite for tests that are sensitive to FMA codegen, so they will pass independently of the -ffp-contract setting.

Loosen the tests by increasing the tolerance would only be a viable alternative if we can prove that it is indeed irrelevant to the test at hand. Steve says it is, but he also says someone with time could fix the test.

Since nothing is broken and this is an optimisation, I think we do have the time to fix it.

  1. This patch does not hinder anyone from doing the work for either #10 / #11.

It relaxes tests when we should be fixing tests. The first step of making tests irrelevant is disabling them for random reasons instead of fixing them. If that's a route we're willing to take, than we might just as well disable all buildbots.

MatzeB edited edge metadata.Sep 26 2016, 11:40 AM

Two comments:

  1. If these fp-contract changes are enough to break various test-suite tests then we will see the same effects for other compiler users. So please put some explanation (and ideally also some way to deal with such issues) in the release notes!
  2. Is there a list of broken benchmarks somewhere? I don't see anyone doing numerical analysis on the test-suite benchmarks, what you can today though is telling the test-suite to use fpcmp for comparing the results and set some absolute or relative tolerances.

I don't know what "segregating the implementations per architecture" means, and this patch does not disable any testing (just makes it test differently). Let me try to explain some more:

This is exactly what it means: testing different things for different architectures. If it used to work, and it's not working, disabling / redirecting the tests is *not* a solution.

Hold on... both the new and the old state are working states, they're just different. When you use FMAs, as the standard allows, the answer will often be different compared to not using them. However, making this change is actually the most-useful thing to do (I'll explain why I think this below).

  1. This patch does not hinder anyone from doing the work for either #10 / #11.

It relaxes tests when we should be fixing tests. The first step of making tests irrelevant is disabling them for random reasons instead of fixing them. If that's a route we're willing to take, than we might just as well disable all buildbots.

I understand you're being sarcastic, but this is unfair. This is not a random reason, we understand exactly why this change produces changes in program output, and I also believe that testing using -ffp-contract=off is the most-useful test configuration to have (although having others in addition would be great too). Why is it the most useful? Because it is the testing configuration for which we can have the most-precise known-good answer for comparison. As you point out, having strict thresholds for these tests has caught important miscompilation bugs in the past. When you start using FMAs, one issue is that there are lots of places where you now might or might not actually use them, and so the space of potential correct answers is large. We can come up with loser tolerances to capture this freedom, but even doing this precisely is potentially degenerate with bugs for some inputs, and so only having the loser version is actually worse than just testing with FMAs disabled. I would love to have both, but in the mean time, this is the right thing to do.

FWIW, the reason we do this on PPC/LInux, and have for a long time, is exactly because GCC on PPC/Linux defaults to using FMAs, and so GCC needs -ffp-contract=off to pass our test suite (independent of anything else). This is just a general statement about our reference outputs: they're all generated without FMAs. Our flags should encode that dependency so that we don't see spurious failures (with Clang or any other compiler).

  1. Is there a list of broken benchmarks somewhere? I don't see anyone doing numerical analysis on the test-suite benchmarks, what you can today though is telling the test-suite to use fpcmp for comparing the results and set some absolute or relative tolerances.

A first pass at broken benchmarks is anything that failed on the AArch bots with the fp-contract patch:

FAIL: MultiSource/Applications/oggenc/oggenc.execution_time (490 of 2445)
FAIL: MultiSource/Benchmarks/MiBench/telecomm-FFT/telecomm-fft.execution_time (491 of 2445)
FAIL: MultiSource/Benchmarks/VersaBench/beamformer/beamformer.execution_time (492 of 2445)
FAIL: SingleSource/Benchmarks/Linpack/linpack-pc.execution_time (493 of 2445)
FAIL: SingleSource/Benchmarks/Misc-C++/Large/sphereflake.execution_time (494 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/datamining/correlation/correlation.execution_time (495 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/datamining/covariance/covariance.execution_time (496 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm.execution_time (497 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm.execution_time (498 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/atax/atax.execution_time (499 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/bicg/bicg.execution_time (500 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/cholesky/cholesky.execution_time (501 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemver/gemver.execution_time (502 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gesummv/gesummv.execution_time (503 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/mvt/mvt.execution_time (504 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.execution_time (505 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trisolv/trisolv.execution_time (506 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm.execution_time (507 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/linear-algebra/solvers/gramschmidt/gramschmidt.execution_time (508 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/stencils/adi/adi.execution_time (509 of 2445)
FAIL: SingleSource/Benchmarks/Polybench/stencils/fdtd-2d/fdtd-2d.execution_time (510 of 2445)

Hold on... both the new and the old state are working states, they're just different.

No. In the extreme case, "(return 0) == 0" is a way to make the test "pass". Of course this it not the case, but changing tests to accomodate code changes always has some degree of relaxation. We just want to make sure the relaxation is meaningful.

When you use FMAs, as the standard allows, the answer will often be different compared to not using them. However, making this change is actually the most-useful thing to do (I'll explain why I think this below).

Absolutely. But ARM and AArch64 have FMA, and if that change is beneficial to other architectures, and this is the default on Clang, than *NOT* testing it means real code may break and we won't notice.

Believe me, I've been there before and have suffered the consequences.

I understand you're being sarcastic, but this is unfair.

I wasn't being sarcastic, it was meant as a hyperbole.

This is not a random reason, we understand exactly why this change produces changes in program output, and I also believe that testing using -ffp-contract=off is the most-useful test configuration to have (although having others in addition would be great too).

But it's not testing what the user will have, and that's a recipe for disaster.

Why is it the most useful? Because it is the testing configuration for which we can have the most-precise known-good answer for comparison. As you point out, having strict thresholds for these tests has caught important miscompilation bugs in the past. When you start using FMAs, one issue is that there are lots of places where you now might or might not actually use them, and so the space of potential correct answers is large. We can come up with loser tolerances to capture this freedom, but even doing this precisely is potentially degenerate with bugs for some inputs, and so only having the loser version is actually worse than just testing with FMAs disabled. I would love to have both, but in the mean time, this is the right thing to do.

So, as Steve said, FMA is generally *more* precise, even if not standard C. But this is not about precision, it's about behaviour. I have to test what users will use, not what is more stable or less controversial.

We need to understand what the breaking bots were all about. It's very likely that they're all due to the same underlying problem, so fixing one thing on all tests might make all this argument moot.

In the past, I have fixed problems like this by implementing algorithms by hand (like sin/cos) to make sure the behaviour is known and doesn't heavily rely on system/libc/choices. We'll probably have to do the same here, and I'm surprised that no one is trying to understand the breakage, but work around it.

FWIW, the reason we do this on PPC/LInux, and have for a long time, is exactly because GCC on PPC/Linux defaults to using FMAs, and so GCC needs -ffp-contract=off to pass our test suite (independent of anything else). This is just a general statement about our reference outputs: they're all generated without FMAs. Our flags should encode that dependency so that we don't see spurious failures (with Clang or any other compiler).

If the behaviour defaults to ON, then I *have* to test it ON. That's a simple truth of validation that we should not budge.

spatel updated this revision to Diff 72560.Sep 26 2016, 2:12 PM
spatel retitled this revision from [testsuite] turn fp-contract off for ARM because output checking is not flexible enough to [testsuite] turn fp-contract off because we can check those results more precisely.
spatel updated this object.
spatel edited edge metadata.

Patch/title/summary updated:
After re-reading Hal and Steve's Sept 23 replies and in light of today's comments, I realize that I unnecessarily made the change about ARM when the arguments really apply to *all* targets. Sorry about that.

Also, I still don't know how to check this locally, and it's not clear to me how to make the change apply to a cmake build. If this patch has any hope of going forward, feel free to put me out of my misery and commandeer this patch! :)

rengolin requested changes to this revision.Oct 4 2016, 3:16 AM
rengolin edited edge metadata.

Rejecting it for the same reasons as expressed in D25194 and the other similar threads.

"Ideally, we would run everything again with -ffp-contract=on and fix any output-checking that is not prepared for that possibility."

That's what we realistically need to do, not ideally.

cheers,
--renato

This revision now requires changes to proceed.Oct 4 2016, 3:16 AM
spatel abandoned this revision.Oct 4 2016, 7:38 AM

Abandoning. D25194 is a more precise implementation of this idea.