This is an archive of the discontinued LLVM Phabricator instance.

Update the Polybench tests to check relative error

Authored by andrew.w.kaylor on Jun 25 2021, 11:17 AM.



This change updates the Polybench tests to check results using relative error rather than absolute error and adds an option to have different tolerances depending on whether fast-math and fp-contract are enabled.

(Note: In the first draft, I'm only updating one test in order to get feedback.

Diff Detail

Event Timeline

andrew.w.kaylor requested review of this revision.Jun 25 2021, 11:17 AM
andrew.w.kaylor set the repository for this revision to rT test-suite.Jun 25 2021, 1:14 PM
rengolin added inline comments.Jun 25 2021, 2:04 PM

Pass in comment: this adds a (variably) slow fmul to the existing fsub and fabs, which can more than double the execution time in this piece of code. If we have enough runtime on the actual benchmark, it may not be a big problem, though.

rengolin added a comment.EditedJun 27 2021, 3:59 AM

Hi Andrew,

Thanks for the RFC, I think this is a good discussion to have.

I think the CMake side makes sense, because we may want different precision checks depending on how we run. Even in production systems we may accept imprecision (or increased precision) for different reasons (including performance).

The actual check change is less clear.

True, fabs(a-b) is problematic. It doesn't work well for very small or very large numbers, it ignores differences relative to the magnitude of the values, and we need to add an EPSILON for every test, sometimes more than one. But that is also true for normalising the values, as some tests need to be more precise than others anyway.

We're also still not handling corner cases that, for example boost, does. I'm not proposing we use boost, but perhaps we could copy some of the logic from there, and create a small library that the tests include and use to do the comparisons.

Then, all the tests have to know is which values they should check and what is the appropriate epsilon for each one. We can even make the default epsilon to be proportional to one of the values, say eps = (a * 1e-5) then do fabs(a - b) < eps (or whatever covers more corner cases in the FP-world) and let users override the value of eps only if necessary.

Makes sense?

I think having a more robust library for FP comparisons in the test suite would be great. In fact, I thought about moving even the simple comparison I did in this patch to some common location where it could be included by other tests. Maybe that's a good starting point for an incremental change in the direction you're suggesting. A robust library of such checks is beyond the scope of what I have time to do at the moment, but I would certainly support that direction.

In this case (atax) and I think probably for the rest of the polybench tests, the relative error check I added is probably good enough. My main concern is the case where one of the values is zero. If we've enabled FTZ for the test and the values are approaching the denormal range this could cause a very small value to compare as out-of-tolerance with a value that got flushed to zero. I suppose we can deal with that problem if/when it happens.

In the interest of full disclosure, the FP_TOLERANCE value I used for the atax change was determined experimentally from the actual deltas introduced by enabling FMA for the test. It's shorthand for (2.0*DBL_EPISOLN), so I think it's a very reasonable tolerance, maybe even a bit conservative. I don't know if it would be enough for other architectures or library-related errors. If not, it could probably be relaxed without reducing the value of the test.


I'm not sure how this test actually measures performance. The code includes statements (polybench_start_instruments, polybench_stop_instruments) that seem to be intended to isolate the optimized kernel for performance measurements, but I am skeptical that this is the measurement that actually shows up in the LNT comparison.

In any event, the test is running the kernel twice in addition to doing the comparison. I would hope that the kernel calculation overshadows the FP comparison. It's still less than ideal, but what we'd end up with for total run time is 'precise kernel time' + 'optimized kernel time' + 'FP comparison time'. Most relevant performance changes would probably effect both kernel calculations, so the benchmark isn't entirely useless. Any FP-specific performance changes might be masked somewhat by the double run.

I think the best way around this is to have a bot testing for performance that doesn't check the FP values at all and another bot checking for correctness. Of course, I don't know who will be able to supply the machines for this.

rengolin accepted this revision.Jun 30 2021, 8:07 AM

I agree a generic comparison library is outside the scope of this patch.

Getting machines to run the test-suite like the bots, for less popular architectures, usually means taking one of the bots offline and running there. This is obviously tricky.

Just like with the previous test-suite patches, and our general policy, if you test on one architecture it should be fine to commit and monitor breakages in other arches.



Ah, yes, you're right. We used to just time those but that must have changed since last time I looked at them seriously.

This revision is now accepted and ready to land.Jun 30 2021, 8:07 AM