The clang review https://reviews.llvm.org/D74436 changes the default setting for ffp-contract to true, and that causes about 20 test-suite LNT tests to fail on X86. This patch adds some pragma to suppress FP_CONTRACT and allows the tests to pass. In llvm-dev email conversation @jdoerfert said this kind of modification would be OK.
Details
- Reviewers
andrew.w.kaylor kbsmith1 jdoerfert fhahn kristof.beyls sebpop rengolin - Commits
- rT24550c3385e8: [test-suite] Suppress FP_CONTRACT due to command line changes planned in D74436
rTa52cf9f13240: [test-suite] Suppress FP_CONTRACT due to command line changes planned in D74436
rTccc86f938839: [test-suite] Suppress FP_CONTRACT due to command line changes planned in D74436
Diff Detail
- Repository
- rT test-suite
Event Timeline
I think this is generally reasonable, haven't looked at all benchmarks and I would prefer some other opinions.
SingleSource/Benchmarks/Polybench/stencils/adi/adi.c | ||
---|---|---|
74 | I'd always put it at the function entry, consistency and easier to read. |
OK I can change that. In some of the files I put the pragma at file scope. It would take more time to find all the functions that are affected.
This seems OK as a short term solution, but it is also problematic in that it prevents FMA from being used in these performance measurements. I understand that it's not trivial to allow FP-related error tolerance in the test, but (as a future patch) it would be nice to at least have a way to turn off the exact result checking on FP tests so that the performance could be measured with fast-math and fp-contract enabled, and that would require having a way to re-enable FMA for these tests.
Would it be better to add "-ffp-contract=off" to the test build lines for these cases (and any others that are already using the pragma)?
Thanks @andrew.w.kaylor. There are about 70 instances of the pragma in the test-suite, I believe most or all were added in https://reviews.llvm.org/D25346 ; I can create another revision of this patch which suppresses for the entire test using the command line option.
This revision of the patch adds ffp-contract=off into the build line for the 20 tests which are failing on X86 due to the dependent clang patch.
This is fine long-term for applications, but some groups monitor these benchmark numbers. I'm adding @kristof.beyls but I'm sure there are more.
We have done this kind of change before, but with the promise that whoever did this will work on bringing them back up. Hoping that someone else will do that later leads to no one doing it.
I'd add a big FIXME comment on each test that runs in benchmark mode explaining that this is temporary and relies on a specific clang change.
Worst case scenario, someone else stumbles upon the slow down, sees FMAs not being generated and finds the flags in the CMake files. The comment will immediately lead back to the clang change.
That review was carefully crafted by @sebpop to give us the numbers we need without breaking the reporting and it took a while to get right. I'd like to continue having that info long term.
But this change is more than just about Polybench. I'm unsure how this will interact with your current change, or even a whole test approach.
Looking further, the Clang review doesn't seem temporary, so I'm not sure what the change to remove the checks added here will be, "long term".
If you look at what @sebpop did, it was a complex dance to get performance numbers and also accuracy checks. If we disable it now, we'll need similar work done for all other benchmarks as well as fixing polybench.
The more I read this the more uncomfortable I get... I think we need to take a step back and understand what the problem is and how to fix it, rather than patch the tests when they fail...
@rengolin
Thanks I appreciate the issues you raise.
Some of the test-suite benchmarks were modified in D25277 adding the pragma to suppress FP_CONTRACT. Since the pragma overrides the command line option, contraction will always be suppressed in the affected region of source code, regardless if ffp-contract=on on the command line.
Can you suggest a way forward? The clang patch was stalled months ago due to this test-suite, but many floating point experts opined that the clang change is deisreable.
Pragmas are specific to a piece of code that may be problematic. This is preferable than a catch-all CXXFLAGS option.
My opinion in the Polybench discussion is that the LLVM test-suite isn't a good fit for running benchmark numbers because it's both a test-suite and a benchmark-suite.
For tests, we must have reproducible and stable numbers, while for benchmarks we must have the highest level of optimisation we can get.
Do we need to check that the output is "reasonable" in benchmarks? Or do we leave broken codegen and assumptions for other tests?
Do we really need Polybench to run twice? Or do we have Polybench in the test-suite as a "test" and allow people to plug their own (updated) Polybench as an external benchmark?
I have always preferred to have the test-suite as just a test suite, and have benchmarks as external programs, because everyone runs differently.
Can you suggest a way forward? The clang patch was stalled months ago due to this test-suite, but many floating point experts opined that the clang change is deisreable.
I can see a few solutions:
- Add a pragma to the places that break stuff with a massive FIXME comment above, if there's only one place in the test where it breaks.
- Keep the CXXFLAGS changes for the other tests if there would be too many pragmas. But remember to update the comment lines as they're incomplete ("due to...?").
- Disable the fp-contract run of Polybench altogether, reverting @sebpop's change. This is preferable than running twice, none with fp-contract.
The final solution is probably a mix of them.
Is that helpful?
cheers,
--renato
https://reviews.llvm.org/D74436 is a very important change: thanks @mibintc for working on this!
https://reviews.llvm.org/D25277 and other changes to LLVM test-suite was motivated by a similar change in Clang to make the default behavior similar to what GCC does for fp contraction.
I agree. I'm just trying to avoid running tests that don't do anything. If we're not going to have fp-contract on Polybench, then we shouldn't be running it twice, just roll back those changes and suppress fp-contract on them.
To be clear, I'm happy with the flags being there, I'm just not happy with blindly applying them without disabling the second Polybench run.
and
- Disable the fp-contract run of Polybench altogether, reverting @sebpop's change. This is preferable than running twice, none with fp-contract.
I agree with you Renato: polybench should be run once.
I would recommend to run polybench without checking for correctness issues.
The original purpose of polybench is to be a performance benchmark.
The original code of polybench did not contain correctness checks.
The correctness checks have been added by LLVM developers, and checking for correctness is a hard problem for FP.
Right, and this is my main point. It may be best if we don't run Polybench at all, and let people connect it to external benchmarks, like SPEC.
The test-suite is a "test" "suite". We initially added benchmarks because we wanted to make sure we don't break them (ie. correctness).
Many years ago, @kristof.beyls kick started an effort to add a benchmark mode and that was largely successful, but it added one constraint we didn't have before: the programs had to be both right and fast.
The way we "fixed" Polybench was to add two runs, but that's not perfect, as now we have two programs running from the same sources.
The benchmark mode used to be very useful, Arm checked it frequently and fixed performance bugs from those. I'm sure other groups did the same. I'm not sure how useful it is today, but I'd be sad to see it go.
However, the test-suite is constantly blocking progress in the compiler with its rigid fractal structure and, more frequent than not, we're opting for heavy handed "fixes" to it.
So, my short-term proposal is:
- Disable the second run of Polybench, force fp-contract=off for it's remaining run
- Apply the flags to the remaining tests (as is in this review)
My long-term proposal is:
- Only allow programs (including benchmarks) inside the test-suite if and only if they can be tested (ie. stable output)
- Apply whatever changes needed to make the output stable without breaking the original premise of the programs/benchmarks
- Never run those programs as benchmarks, only as tests
- Improve the external part of the test-suite and move all existing benchmarks there (perhaps a different repo?)
- Make benchmark mode run only external benchmarks (official and local/custom ones)
- Never mix tests with benchmarks again
In response to the review comments, I'm updating the patch like this. It's not complete just showing you the idea.
In addition to adding -ffp-contract=off to the failing tests, add -DFMA_DISABLED=1
Then in the test itself, if FMA is disabled, don't do the duplicate run of the kernel.
What do you think?
I'm ok with that, for now. Medium term, it would be best to just remove the second run, but I don't mind it not being in this patch.
In Phab, I can't remove my rejection without approving, so I'm doing that to show that I'm ok with any solution in this space.
But please get @sebpop's approval before committing, to make sure what we get is the cleanest solution.
Thanks!
Here's the patch with ffp-contract=off for the 20 tests that failed with X86 and modifications to the test source to suppress the 2nd kernel execution #if !FMA_DISABLED
When I execute the tests using llvm-lit with just ffp-contract=off, and not trying to avoid the 2nd kernel execution, the tests pass. However when I execute the tests and skip the 2nd kernel execution then 11 tests fail. I assume this is because the hash value needs to be regenerated. What's the procedure for doing that?
These are the tests that fail. The adi test fails comparing the hash value, I didn't check the others yet but I assume it's the same.
Failed Tests (11):
test-suite :: SingleSource/Benchmarks/Polybench/datamining/correlation/correlation.test test-suite :: SingleSource/Benchmarks/Polybench/datamining/covariance/covariance.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/atax/atax.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/bicg/bicg.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemver/gemver.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gesummv/gesummv.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trisolv/trisolv.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/solvers/gramschmidt/gramschmidt.test test-suite :: SingleSource/Benchmarks/Polybench/stencils/adi/adi.test
Testing Time: 861.18s
Passed: 305 Failed: 11
Perhaps it would be easier to just remove the part of the code that runs twice? Probably also need to remove the output files and maybe some CMake magic that compares them?
I corrected the test output and the tests are passing now. This patch defines FMA_DISABLED=1 and ffp-contract=off for the 20 polybench test cases that are failing with D74436 applied. The remaining polybench test cases are run without those additional options, and so the additional kernel invocation with StrictFP does occur. These are the 20 tests
SingleSource/Benchmarks/Linpack
SingleSource/Benchmarks/Misc-C++/Large
SingleSource/Benchmarks/Polybench/datamining/covariance
SingleSource/Benchmarks/Polybench/datamining/correlation
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trisolv
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemver
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gesummv
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/bicg
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/atax
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/gramschmidt
SingleSource/Benchmarks/Polybench/stencils/adi
MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR
MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG
MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE
MultiSource/Benchmarks/Rodinia/srad
MultiSource/Applications/oggenc
MicroBenchmarks/ImageProcessing/BilateralFiltering
MicroBenchmarks/ImageProcessing/Blur
@sebpop Can you please take a look?
I plan to push this in a couple days. Do the LNT bots pull the source from latest or is a special action needed to feed the updated tests into the bots?
When I committed the clang patch, the same tests are also failing on arm. I'd like to disable contract on those tests unconditionally not just architecture specific
Plus I'm not sure if my architecture test is working right because my patch didn't "work" for broadwell. Confused: is my patch incorrect for x86 or my patch didn't get to the bot?
Here's the list of test fails for arm:
Failed Tests (22):
test-suite :: MicroBenchmarks/ImageProcessing/BilateralFiltering/BilateralFilter.test test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test test-suite :: MultiSource/Applications/oggenc/oggenc.test test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG/HPCCG.test test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test test-suite :: MultiSource/Benchmarks/Rodinia/srad/srad.test test-suite :: SingleSource/Benchmarks/Linpack/linpack-pc.test test-suite :: SingleSource/Benchmarks/Misc-C++/Large/sphereflake.test test-suite :: SingleSource/Benchmarks/Polybench/datamining/correlation/correlation.test test-suite :: SingleSource/Benchmarks/Polybench/datamining/covariance/covariance.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/atax/atax.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/bicg/bicg.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemver/gemver.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gesummv/gesummv.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trisolv/trisolv.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm.test test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/solvers/gramschmidt/gramschmidt.test test-suite :: SingleSource/Benchmarks/Polybench/stencils/adi/adi.test
No I'm wrong, the LNT bot was updated with my patch, but the Broadwell tests failed for execution_time versus the previous run where they failed for hash code not matching.
Other targets, like arm and s390 failed today due to the clang patch because the hash code didn't match. I can make this change which applies generally not target-specific.
I don't know yet how to address the execution_time fails.
I will submit an updated patch which unconditionally adds ffp-contract=off to the failing tests (versus checking ARCH)
After this change, building the LLVM test-suite for X86_64h seems to fail http://green.lab.llvm.org/green/job/test-suite-verify-machineinstrs-x86_64h-O3/9586/ :
FAILED: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/CMakeFiles/2mm.dir/2mm.c.o /Users/buildslave/jenkins/workspace/test-suite-verify-machineinstrs-x86_64h-O3/test-suite-build/tools/timeit --summary SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/CMakeFiles/2mm.dir/2mm.c.o.time /Users/buildslave/jenkins/workspace/test-suite-verify-machineinstrs-x86_64h-O3/compiler/bin/clang -DFP_ABSTOLERANCE=1e-5 -DNDEBUG -B /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin -Wno-unused-command-line-argument -mllvm -verify-machineinstrs -O3 -arch x86_64h -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk -w -Werror=date-time -I /Users/buildslave/jenkins/workspace/test-suite-verify-machineinstrs-x86_64h-O3/test-suite/SingleSource/Benchmarks/Polybench/utilities -DPOLYBENCH_DUMP_ARRAYS -ffp-contract=off -DFMA_DISABLED=1 -ffp-contract=off -DFMA_DISABLED=1 -MD -MT SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/CMakeFiles/2mm.dir/2mm.c.o -MF SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/CMakeFiles/2mm.dir/2mm.c.o.d -o SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/CMakeFiles/2mm.dir/2mm.c.o -c /Users/buildslave/jenkins/workspace/test-suite-verify-machineinstrs-x86_64h-O3/test-suite/SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm.c /Users/buildslave/jenkins/workspace/test-suite-verify-machineinstrs-x86_64h-O3/test-suite/SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm.c:195:31: error: use of undeclared identifier 'D_StrictFP' 2mm.c error: use of undeclared identifier 'D_StrictFP'
The previous revision had a compilation error in 2mm.c I've corrected that.
This patch affects multiple architectures. Is there a way to submit this patch for testing before pushing to the trunk?
When I use cmake on test-suite, the Makefile created doesn't create a build for 2mm, 3mm or gemm. I don't know why. I verified by hand that those 3 files compile OK.
Not automatic. The best you can do is to ask the maintainers of the architecture bots to test that for you.
I can resurrect some Armv7/v8 machines at home and try, but if @maxim-kuvyrkov could run it on Linaro servers, that'd be much faster.
When I use cmake on test-suite, the Makefile created doesn't create a build for 2mm, 3mm or gemm. I don't know why. I verified by hand that those 3 files compile OK.
Hm, CMake is the default way of building the test-suite for a number of years, but this could have been an oversight of the migration?
Honestly, I haven't built the test-suite for a long time, so I'm not sure what's the current status.
Turns out my boards are too old and need some love that I'm unable to give them.
If you can't find anyone that can help you, an alternative is to merge and lookout for breakages, and if you only see one of two, (revert and) liaise with the bot owners to fix them.
If you have too many errors, or errors in the build and non-test-suite tests, then it's time to go back to the drawing board. :(
It's not the nicest way to handle it but it works if nothing else does.
Fixed an initialization bug in 2mm.c
BTW the reason that 2mm 3mm and gemm didn't build is because I was using -DTEST_SUITE_BENCHMARKING_ONLY=true on the cmake invocation, so those 3 test directories were not included.
Ah, yes. You need to test both test/benchmark modes, as we have bots for both modes on at least x86_64, arm and aarch64.
Thanks for looking
If you can't find anyone that can help you, an alternative is to merge and lookout for breakages, and if you only see one of two, (revert and) liaise with the bot owners to fix them.
I plan to try pushing again when I have a block of time to babysit the bots. Is the place I should watch on Green Dragon? e.g. https://green.lab.llvm.org/green/job/test-suite-verify-machineinstrs-x86_64h-O3/
I can see there's aarch64 and x86 but what about other arch?
If you have too many errors, or errors in the build and non-test-suite tests, then it's time to go back to the drawing board. :(
It's not the nicest way to handle it but it works if nothing else does.
That, and https://lab.llvm.org/buildbot/#/console
I don't know anymore which bots build the test-suite. But you'll get an email if you break anything, so just make sure none of them are related to your patch and you should be fine.
About this specific point: the main requirement is to keep the "noisy" bots green.
You're not required to know of all bots that build "componentY" on "archX", just be alert for emails and fix them as they come.
If there is no "archX" bot building the test-suite, then you don't need to worry about archX when changing the test-suite.
I'd always put it at the function entry, consistency and easier to read.