This is an archive of the discontinued LLVM Phabricator instance.

[llvm-exegesis] Let Counter returns up to 16 entries.
ClosedPublic

Authored by oontvoo on Jun 2 2020, 8:17 PM.

Details

Reviewers
ondrasej
courbet
Summary

LBR contains (up to) 16 entries for last x branches and the X86LBRCounter (from D77422) should be able to return all those.
Currently, it just returns the latest entry, which could lead to mis-leading measurements.
This patch aslo changes the LatencyBenchmarkRunner to accommodate multi-value readings.

Diff Detail

Event Timeline

oontvoo created this revision.Jun 2 2020, 8:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2020, 8:17 PM
courbet added inline comments.Jun 2 2020, 11:29 PM
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
71

Please get rif of the magic number. Why 16 ? what about Counter->numValues() ?

109–113

Please split this off to void accumulateCounterValues(ValueOrError.get(), CounterValues)

110

[style]

if (!ValueOrError)
  return ValueOrError.takeError();
...
116

Why not += ?

llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp
33

ComputeVariance ?

43

This is missing a square.

66

So computing the min and stddev across values in runAndMeasureMulti() requires them to be measuring the same thing. From the documentation it's not clear to me what the values represent. Are they always homogeneous ? Can you give an example of what they are in the LBR case.

73

why not std::numeric_limits<double> ?

80

Technically you're computing a variance.

81

Because of short-circuiting, if WithMinStdev.empty(), then CurStdev will not be evaluated, and CurStdev will be zero. Then Stdev will be set to 0, preventing any further updates.

82

Was this supposed to be the other way around ? Here we are selecting the largest stddev.

llvm/tools/llvm-exegesis/lib/PerfHelper.cpp
137–139

This returns a vector with Count elements set to zero. If this passes all tests then we are clearly missing tests :(

This looks good in general, but we should be careful about aggregating the values from the measurements (and aggregation when the counter returns multiple values). In particular for the LBR, we'd be losing interesting and potentially useful information by aggregating all the values into a single number.

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.h
72

This should be llvm::SmallVector<int64_t, N> with some small N (e.g. 4) to avoid memory allocations when there is just one return value. This is the case for virtually all counters except the LBR.

73

Consider calling this runAndSample().

llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp
38

You could do
const double Sum = std::accumulate(Values.begin(), Values.end(), 0.0);
instead.

98

I'd prefer to have more flexibility about the numbers that are returned and reported by the tool. The original code collapsed the measurements to a single value (the minimum). This is useful when looking for lower bounds/optimistic numbers, but other processing would also make sense:

  • the mean of the measurements - this might give a better idea of the actual performance when running in the loop.
  • the list of all measured values - so that the user can analyze the distribution of the measurements by themselves (and go beyond the mean/variance).

In particular for the LBR measurements: being able to see the raw timings of individual loop iterations is what makes that measurement method so appealing, and we'd lose all that information if we just took an aggregate value, be it a min or a mean. The min and mean from the LBR do have their value and I'd expect them to be more precise than the measurements over an unrolled loop, so ideally we'd have a way to see both.

I think a good solution would be to add an argument to this method that determines how the values are aggregated over the measurements:

  • min,
  • mean,
  • min variance,
  • keep all.

    and over the contents of the returned vector:
  • min,
  • mean,
  • keep all.

This will mean more changes up the call stack, but it would give us a lot more flexibility in using the tool. What do you think?

llvm/tools/llvm-exegesis/lib/PerfHelper.cpp
121–122

You might want to check that there is at least one element.

131

This should be llvm::SmallVector<int64_t, some small N> - most counters return just a single value.

oontvoo marked an inline comment as done.Jun 3 2020, 6:39 AM
oontvoo added inline comments.
llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp
98

Yes, that's a good idea. I've thought about this a bit more and realised even keeping min-variance (of each read) could still be completely wrong.

Imagine your benchmarked code has a number of distinct branches; then there is no reason to expect the cycles from these branches to correlate. The min-variance approach is only meaningful if it's the same code-path.

oontvoo marked an inline comment as not done.Jun 3 2020, 6:46 AM
oontvoo updated this revision to Diff 270029.Jun 10 2020, 9:34 PM
oontvoo marked 19 inline comments as done.

Updated diff

llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp
66

Yes, they're supposed to be homogenous, representing the measurements of the same block sampled at fixed rate.

For the LBR specifically:

  • Each value in the vector is the number of elapsed cycles since last branch-retire
  • When we take a sample, we have 16 of such values, representing the last 16 branches.
  • If we have a loop whose body is a basic-block, then effectively these are the measurements for the last 16 iterations[0]. if the loop body has some branches then we'd need a different aggregation strategy.

[0] I think this is where it's a bit "wrong" right now. We always only read the last 16 branches. What we probably want is to be able to have the BenchmarkRunner pause at fixed period and take a sample.

80

¯\_(ツ)_/¯ indeed!

oontvoo updated this revision to Diff 270129.Jun 11 2020, 7:09 AM

Fix clang-tidy issues

I've added a couple of style comment + one bigger comment on the aggregation of results from multiple runs/counter buffer.

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
56

NewValues should be a const reference, no?

65

This is probably not a big deal, but consider:

const size_t NumValues = std::max(NewValues.size(), Result->size());
if (NumValues > Result->size()) Result->resize(NumValues, 0);
for (size_t i = 0, End = NewValues.size(); i < End; ++i) {
  (*Result)[i] += NewValues[i];
}

for unrolled + vectorized code.

71

s/reserved/Reserved/

85

I'd prefer not using assignment in function arguments - this might be easily overlooked or mistaken for a bug (assignment in place of equality comparison).

llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp
41

std::pow is somewhat heavy-weight as it is not optimized for integer exponents,
const double Delta = V - Mean;
Ret += Delta * Delta;
will likely be significantly faster.

53

A shorter way to write this:

return std::accumulate(
  Values.begin(), Values.end(), std::numeric_limits<int64_t>::max(),
  [](int64_t A, int64_t B) { return A < B ? A : B; });

And similar for FindMax below.

84

Nit: Since we have WithMinVariance, consider renaming Variance to MinVariance, and CurVariance to just Variance.

85

Ideally, this should use move semantics (if they are supported by SmallVector).

86

This would update Variance (MinVariance with the rename proposed above) even if the variance increased. What you probably want is

if (Variance < MinVariance) {
  WithMinVariance = *ExpectedCounterValues;
  MinVariancec = Variance;
}
106

I'd still split this into two arguments:
One that decides what happens with the return values of runAndSample() with {Concatenate, MinVariance}.

And another one that decides what to do with the results of the previous step (Min, Max, Mean, return as is}.

With the current MinVariance filtering in all cases, we're changing the behavior for scalar counters, where we might drop some values (and we might drop some values also in the LBR case).

llvm/tools/llvm-exegesis/lib/PerfHelper.cpp
155

This should also be llvm::SmallVector (please test with HAVE_LIBPFM undefined/false).

oontvoo updated this revision to Diff 270164.Jun 11 2020, 9:14 AM

Fixed another clang-tidy issue

oontvoo updated this revision to Diff 270202.Jun 11 2020, 12:08 PM
oontvoo marked 13 inline comments as done.

Updated diff

llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp
106

Ah, I think we can infer how to accumulate the values returnt by runAndSample().

  • If the return vector has more than 1 element, then keep the set with min variance (because it wouldn't make sense to concat all the values from different runs together)
  • If the vector only has 1 element, then concat them together to find Min/Max/Mean

One last comment, but otherwise this looks good. I'll leave the approval to Clement.

llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp
73

Very nit: if we're working with doubles, std::numeric_limits<double>::infinity() might be even better.

oontvoo updated this revision to Diff 271816.Jun 18 2020, 12:58 PM

std::numeric_limits<double>::infinity()

oontvoo marked an inline comment as done.Jun 18 2020, 1:00 PM
courbet added inline comments.Jun 19 2020, 5:28 AM
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
62

[style] no braces

llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp
53

Why not std::min_element ?

llvm/tools/llvm-exegesis/lib/Target.cpp
74

Let's do this now to avoid being in an inconsistent state.

oontvoo updated this revision to Diff 273211.Jun 24 2020, 6:14 PM
oontvoo marked 4 inline comments as done.

update diff

courbet added inline comments.Jun 25 2020, 12:26 AM
llvm/tools/llvm-exegesis/lib/Target.cpp
121

I the default Target implementation does not use it, let's just do:

std::unique_ptr<BenchmarkRunner> ExegesisTarget::createUopsBenchmarkRunner(
    const LLVMState &State,
    InstructionBenchmark::ResultAggregationModeE /*unused*/) const {
  return std::make_unique<UopsBenchmarkRunner>(State);
}
oontvoo marked an inline comment as done.Jun 25 2020, 7:14 AM
oontvoo added inline comments.
llvm/tools/llvm-exegesis/lib/Target.cpp
121

Should we change the impl to use it?

oontvoo updated this revision to Diff 273373.Jun 25 2020, 8:13 AM
oontvoo marked an inline comment as done.

Removed unused ResultAggMode

oontvoo updated this revision to Diff 273374.Jun 25 2020, 8:15 AM

update diff

courbet accepted this revision.Jun 26 2020, 12:27 AM
This revision is now accepted and ready to land.Jun 26 2020, 12:27 AM
oontvoo closed this revision.Jun 26 2020, 8:50 AM