This is an archive of the discontinued LLVM Phabricator instance.

[test-suite] [Polybench] run tests twice with -ffp-contract=on/off
ClosedPublic

Authored by sebpop on Oct 6 2016, 1:58 PM.

Details

Summary

This patch implements the idea that I posted as a reply to Matthias on D25277:
On Wed, Oct 5, 2016 at 5:33 PM, Matthias Braun <matze@braunis.de> wrote:

I was mainly wondering here whether there may be a sensible generic mechanism to combine a list of floatingpoint numbers

For all the tests in the polybench, I think we can do bisimulation.
What I mean by bisimulation is that we would copy the kernel()
function of each polybench/test.c and name it kernelNoFP() and add
flag attributes "-fno-fast-math -ffp-contract=off" (can be split
compilation if flag attributes do not work.)
main() will call kernel() and kernelNoFP() and compare their output
with FP_TOLERANCE.
Only the execution of kernel() will be timed for benchmark performance result.
main() will only print the output from kernelNoFP() that will be
hashed and compared against the reference hash (as we currently expect
exact match of the output hash.)

The good things:

  • no modifications to CMake and Makefiles
  • no extra space to store the extra reference output
  • tests both user CFLAGS specified mode and fast-math and fp-contraction=off.

The bad things: (because of the extra reference run of kernelNoFP())

  • compilation time will double: e.g., Polly will optimize both kernels,
  • memory requirements on the device will almost double: added one

extra output array, input arrays are not modified, so no need to
duplicate them,

  • compute time on the device will more than double: running the kernel

twice, plus an extra loop over both outputs to compare with
FP_TOLERANCE.

Diff Detail

Repository
rL LLVM

Event Timeline

sebpop updated this revision to Diff 73849.Oct 6 2016, 1:58 PM
sebpop retitled this revision from to [test-suite] [Polybench] run tests twice with -ffp-contract=on/off.
sebpop updated this object.
sebpop added reviewers: grosser, rengolin, MatzeB.
sebpop added subscribers: llvm-commits, Abe.
rengolin requested changes to this revision.Oct 7 2016, 2:19 AM
rengolin edited edge metadata.
rengolin added inline comments.
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm.c
116 ↗(On Diff #73849)

sigh... We've been over this...

This revision now requires changes to proceed.Oct 7 2016, 2:19 AM
sebpop added inline comments.Oct 7 2016, 2:23 AM
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm.c
116 ↗(On Diff #73849)

What do you mean we've been over this? Please be more precise in your review.

I hope you do understand that this pragma only applies to the current function.
The test is thus testing both modes: kernel_2mm() tests with fp-contract=on and kernel_2mm_StrictFP tests with fp-contract=off.

What I mean by bisimulation is that we would copy the kernel()
function of each polybench/test.c and name it kernelNoFP() and add
flag attributes "-fno-fast-math -ffp-contract=off" (can be split
compilation if flag attributes do not work.)
main() will call kernel() and kernelNoFP() and compare their output
with FP_TOLERANCE.
Only the execution of kernel() will be timed for benchmark performance result.
main() will only print the output from kernelNoFP() that will be
hashed and compared against the reference hash (as we currently expect
exact match of the output hash.)

So, the end result is that we'll *test* what we don't ship and we'll *benchmark* what we don't test.

Meaning we'll have green bots and fast benchmarks, but we'll have no idea if the "fast" is correct of if the correct is fast.

--renato

rengolin added inline comments.Oct 7 2016, 2:29 AM
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm.c
30 ↗(On Diff #73849)

clang-format

114 ↗(On Diff #73849)

clang format

132 ↗(On Diff #73849)

clang format

181 ↗(On Diff #73849)

clang format

205 ↗(On Diff #73849)

clang-format

rengolin added inline comments.Oct 7 2016, 2:33 AM
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm.c
140 ↗(On Diff #73849)

why do you need the pragma here, too?

146 ↗(On Diff #73849)

If this has to be defined at compilation time, do we need to re-refine here? Maybe it's better to get a compilation error, to make sure the users know what they're getting.

208 ↗(On Diff #73849)

Why not do this to all arrays, not just D?

sebpop added a comment.Oct 7 2016, 2:34 AM

What I mean by bisimulation is that we would copy the kernel()
function of each polybench/test.c and name it kernelNoFP() and add
flag attributes "-fno-fast-math -ffp-contract=off" (can be split
compilation if flag attributes do not work.)
main() will call kernel() and kernelNoFP() and compare their output
with FP_TOLERANCE.
Only the execution of kernel() will be timed for benchmark performance result.
main() will only print the output from kernelNoFP() that will be
hashed and compared against the reference hash (as we currently expect
exact match of the output hash.)

So, the end result is that we'll *test* what we don't ship and we'll *benchmark* what we don't test.

Meaning we'll have green bots and fast benchmarks, but we'll have no idea if the "fast" is correct of if the correct is fast.

This statement is false.

kernel() is compiled with whatever Cflags are specified by the end user of the test-suite.
The output of kernel() is matched with FP_TOLERANCE against the output of kernel_Strict_FP().
The output of kernel_Strict_FP() is hashed and compared against reference output.

kernel() is timed, as this will be what the end-user of the test-suite asked for "-O0", or "-Ofast -polly", or any other combination.

sebpop added a comment.Oct 7 2016, 2:35 AM

clang-format

No. Not in the test-suite. We do not own that code, and I will not run clang-format on all the tests in the test-suite.

sebpop added inline comments.Oct 7 2016, 2:40 AM
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm.c
140 ↗(On Diff #73849)

Maybe adding the pragma here is "overkill", and would have no effects.
Thanks for pointing out.

146 ↗(On Diff #73849)

This is a quick "get it done" define. Of course, I will modify the patch to add -DFP_ABSTOLERANCE from the CMake/Makefile to the C compilation, as the previous comment suggests.

208 ↗(On Diff #73849)

D is the only output of this test.
The other arrays are input, and input does not change.

clang-format

No. Not in the test-suite. We do not own that code, and I will not run clang-format on all the tests in the test-suite.

Not in the whole test, just in your changes. You're probably using TABS or something, and the formatting is coming wrong. Please look at the result in this page and you'll see.

sebpop added a comment.Oct 7 2016, 2:46 AM

clang-format

No. Not in the test-suite. We do not own that code, and I will not run clang-format on all the tests in the test-suite.

Not in the whole test, just in your changes. You're probably using TABS or something, and the formatting is coming wrong. Please look at the result in this page and you'll see.

I will run clang-format-patch on the changes in the patch.

rengolin added inline comments.Oct 7 2016, 2:49 AM
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm.c
146 ↗(On Diff #73849)

I see. Are you adding the Makefile changes to this patch later?

208 ↗(On Diff #73849)

what about the rest of the output?

sebpop added inline comments.Oct 7 2016, 2:55 AM
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm.c
146 ↗(On Diff #73849)

Yes, if we agree that this is the way to go forward, I will finish up the patch adding for each polybench this kind of change, and the Makefiles and CMake will be amended to add -DFP_{ABS}TOLERANCE to CFLAGS.

208 ↗(On Diff #73849)

Please help me here: which other output?

See the comment in the lines just below:

All live-out data must be printed
by the function call in argument. */

D is the only output of this benchmark (to my knowledge.)

kernel() is compiled with whatever Cflags are specified by the end user of the test-suite.
The output of kernel() is matched with FP_TOLERANCE against the output of kernel_Strict_FP().
The output of kernel_Strict_FP() is hashed and compared against reference output.

Right, now I get it. So, you basically just duplicated the kernel to force FP=OFF on the second. Then you compare against the *real* timed run for FP_TOLERANCE.

This should work as intended, I imagine, but would that scale to *all* problematic benchmarks? If there are 50 of them, duplicating all the cores may bring more problems than solving...

sebpop added a comment.Oct 7 2016, 3:32 AM

kernel() is compiled with whatever Cflags are specified by the end user of the test-suite.
The output of kernel() is matched with FP_TOLERANCE against the output of kernel_Strict_FP().
The output of kernel_Strict_FP() is hashed and compared against reference output.

Right, now I get it. So, you basically just duplicated the kernel to force FP=OFF on the second. Then you compare against the *real* timed run for FP_TOLERANCE.

Correct.

This should work as intended, I imagine, but would that scale to *all* problematic benchmarks? If there are 50 of them, duplicating all the cores may bring more problems than solving...

As I was mentioning, this is intended only for the Polybench, that are following a pretty regular pattern of testing loop kernels.
The patch makes 2mm run correctly for CFLAGS="", "-Ofast", and "-O3 -ffp-contract=on".
I have not looked at the other failing benchmarks to say whether this can apply to those.
I know that oggenc would be impossible to modify like this.

There are also these resource problems I was mentioning:

  • compilation time will double: e.g., Polly will optimize both kernels,
  • memory requirements on the device will almost double: added one extra output array, input arrays are not modified, so no need to duplicate them,
  • compute time on the device will more than double: running the kernel twice, plus an extra loop over both outputs to compare with FP_TOLERANCE.

Not perfect, though it will give us some extra FP tests that we can run at -Ofast and other flags that need FP_TOLERANCE testing.

As I was mentioning, this is intended only for the Polybench, that are following a pretty regular pattern of testing loop kernels.
The patch makes 2mm run correctly for CFLAGS="", "-Ofast", and "-O3 -ffp-contract=on".
I have not looked at the other failing benchmarks to say whether this can apply to those.
I know that oggenc would be impossible to modify like this.

Right. Abe was trying to change the Make/CMake files, which could be a one-size-fits-all solution. If that works, I think we should go with that.

There are also these resource problems I was mentioning:

  • compilation time will double: e.g., Polly will optimize both kernels,
  • compute time on the device will more than double: running the kernel twice, plus an extra loop over both outputs to compare with FP_TOLERANCE.

Those two are true for all solutions we can come up with. 2mm takes 100s on ARM and 3mm takes 150s. Doubling ~50 could make it several minutes longer.

You were proposing getting something running first, then change later, but with a tolerance of 0.0001 after all aggregations, I think we could have FP=on and that tolerance in fpcmp and just run the default as a first approach, then later expand the FP=off compare.

Of course, that only works if the tolerance is low enough on all 50.

  • memory requirements on the device will almost double: added one extra output array, input arrays are not modified, so no need to duplicate them,

Abe's solution wouldn't incur in additional memory consumption, but it would take longer to prepare/compile/run two completely different models of the same benchmark.

cheers,
--renato

sebpop added a comment.Oct 7 2016, 5:17 AM

As I was mentioning, this is intended only for the Polybench, that are following a pretty regular pattern of testing loop kernels.
The patch makes 2mm run correctly for CFLAGS="", "-Ofast", and "-O3 -ffp-contract=on".
I have not looked at the other failing benchmarks to say whether this can apply to those.
I know that oggenc would be impossible to modify like this.

Right. Abe was trying to change the Make/CMake files, which could be a one-size-fits-all solution. If that works, I think we should go with that.

There are also these resource problems I was mentioning:

  • compilation time will double: e.g., Polly will optimize both kernels,
  • compute time on the device will more than double: running the kernel twice, plus an extra loop over both outputs to compare with FP_TOLERANCE.

Those two are true for all solutions we can come up with. 2mm takes 100s on ARM and 3mm takes 150s. Doubling ~50 could make it several minutes longer.

You were proposing getting something running first, then change later, but with a tolerance of 0.0001 after all aggregations, I think we could have FP=on and that tolerance in fpcmp and just run the default as a first approach, then later expand the FP=off compare.

Of course, that only works if the tolerance is low enough on all 50.

I think that the tolerance may be even smaller for the way this patch checks for the outputs in Polybench.
FP_tolerance should be selected (computed as suggested by Stephen) on a per benchmark basis.

  • memory requirements on the device will almost double: added one extra output array, input arrays are not modified, so no need to duplicate them,

Abe's solution wouldn't incur in additional memory consumption, but it would take longer to prepare/compile/run two completely different models of the same benchmark.

Also don't forget Abe's solution has two negative points addressed by this patch:

The good things about this patch:

  • no modifications to CMake and Makefiles (Matthias was complaining about the added complexity to the build system)
  • no extra space to store the extra reference output (1GB extra space and transfer over the network in the case of separate test devices, i.e., ARM)
sebpop added a subscriber: scanon.Oct 7 2016, 5:18 AM

I think that the tolerance may be even smaller for the way this patch checks for the outputs in Polybench.
FP_tolerance should be selected (computed as suggested by Stephen) on a per benchmark basis.

Agreed.

cheers,
--renato

sebpop updated this revision to Diff 74327.Oct 11 2016, 11:16 PM
sebpop edited edge metadata.

Complete patch implementing Proposal 2 for Polybench.

There are 5 benchmarks that need more attention: they require different FP_ABSTOLERANCE and different initial values to be able to compare "-O0 -ffp-contract=off" to "-Ofast":

polybench/linear-algebra/kernels/symm, FP_ABSTOLERANCE=10
polybench/linear-algebra/solvers/gramschmidt, FP_ABSTOLERANCE=1
polybench/medley/reg_detect, FP_ABSTOLERANCE=1e4
polybench/stencils/adi, FP_ABSTOLERANCE=1e4
polybench/stencils/seidel-2d, FP_ABSTOLERANCE=1e-5

sebpop updated this revision to Diff 74754.Oct 14 2016, 5:01 PM
sebpop edited edge metadata.

Updated patch fixes problems in adi and reg_detect.
All polybench pass with -O3 -ffp-contract=on and off on x86_64-linux.
Ok to commit?

MatzeB accepted this revision.Oct 14 2016, 5:12 PM
MatzeB edited edge metadata.

LGTM, thanks for fixing this properly. I leave the final approval to Renato.

SingleSource/Benchmarks/Polybench/datamining/correlation/correlation.c
137 ↗(On Diff #74754)

you actually have to write more to use this macro than to just use sqrt(x[j]) in the first place.

LGTM, thanks for fixing this properly. I leave the final approval to Renato.

I will wait for Renato's approval before commit.

SingleSource/Benchmarks/Polybench/datamining/correlation/correlation.c
137 ↗(On Diff #74754)

I have not changed the code of the kernel.
I intentionally copied the full text without any change from my side in order to avoid any possible error.
If you think there is an error with this line, let's fix that in a separate patch.

rengolin requested changes to this revision.Oct 16 2016, 6:45 AM
rengolin edited edge metadata.

Why are you setting the tolerance AND disabling FP contract, which I think we agreed it wasn't the right way?

Also, the attribute optnone will make sure none of the O2, O3 passes will make any difference, and will confuse the hell out of people trying to benchmark those tests.

You're adding the pragma to the strict loop AND the non-strict loop, what's the point? If both have contract disable, than obviously the compare will always work, regardless of the compiler flags the user sets.

From what I can see you've done everything we agreed should not be done, so I'm not sure what this patch is about...

cheers,
--renato

This revision now requires changes to proceed.Oct 16 2016, 6:45 AM
rengolin added inline comments.Oct 16 2016, 6:47 AM
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/doitgen/doitgen.c
28 ↗(On Diff #74754)

Maybe I understood it wrong, is this pragma's lifetime ended by the end of the function?

If so, then the remaining functions will not have the contract on, and the attribute will also remain in this context.

If that's so, than I think it should be fine.

rengolin added inline comments.Oct 16 2016, 6:54 AM
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/doitgen/doitgen.c
28 ↗(On Diff #74754)

It seems that the interpretation, ignore me, it's been a long week...

rengolin accepted this revision.Oct 16 2016, 6:55 AM
rengolin edited edge metadata.
This revision is now accepted and ready to land.Oct 16 2016, 6:55 AM
This revision was automatically updated to reflect the committed changes.