LBR contains (up to) 16 entries for last x branches and the X86LBRCounter (from D77422) should be able to return all those.
Currently, it just returns the latest entry, which could lead to mis-leading measurements.
This patch aslo changes the LatencyBenchmarkRunner to accommodate multi-value readings.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp | ||
---|---|---|
60 | Please get rif of the magic number. Why 16 ? what about Counter->numValues() ? | |
87 | Please split this off to void accumulateCounterValues(ValueOrError.get(), CounterValues) | |
88 | [style] if (!ValueOrError) return ValueOrError.takeError(); ... | |
94 | Why not += ? | |
llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp | ||
27 | ComputeVariance ? | |
37 | This is missing a square. | |
42 | So computing the min and stddev across values in runAndMeasureMulti() requires them to be measuring the same thing. From the documentation it's not clear to me what the values represent. Are they always homogeneous ? Can you give an example of what they are in the LBR case. | |
48 | why not std::numeric_limits<double> ? | |
56 | Technically you're computing a variance. | |
57 | Because of short-circuiting, if WithMinStdev.empty(), then CurStdev will not be evaluated, and CurStdev will be zero. Then Stdev will be set to 0, preventing any further updates. | |
58 | Was this supposed to be the other way around ? Here we are selecting the largest stddev. | |
llvm/tools/llvm-exegesis/lib/PerfHelper.cpp | ||
136 | This returns a vector with Count elements set to zero. If this passes all tests then we are clearly missing tests :( |
This looks good in general, but we should be careful about aggregating the values from the measurements (and aggregation when the counter returns multiple values). In particular for the LBR, we'd be losing interesting and potentially useful information by aggregating all the values into a single number.
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.h | ||
---|---|---|
71 | This should be llvm::SmallVector<int64_t, N> with some small N (e.g. 4) to avoid memory allocations when there is just one return value. This is the case for virtually all counters except the LBR. | |
72 | Consider calling this runAndSample(). | |
llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp | ||
32 | You could do | |
68 | I'd prefer to have more flexibility about the numbers that are returned and reported by the tool. The original code collapsed the measurements to a single value (the minimum). This is useful when looking for lower bounds/optimistic numbers, but other processing would also make sense:
In particular for the LBR measurements: being able to see the raw timings of individual loop iterations is what makes that measurement method so appealing, and we'd lose all that information if we just took an aggregate value, be it a min or a mean. The min and mean from the LBR do have their value and I'd expect them to be more precise than the measurements over an unrolled loop, so ideally we'd have a way to see both. I think a good solution would be to add an argument to this method that determines how the values are aggregated over the measurements:
This will mean more changes up the call stack, but it would give us a lot more flexibility in using the tool. What do you think? | |
llvm/tools/llvm-exegesis/lib/PerfHelper.cpp | ||
123 | You might want to check that there is at least one element. | |
129 | This should be llvm::SmallVector<int64_t, some small N> - most counters return just a single value. |
llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp | ||
---|---|---|
68 | Yes, that's a good idea. I've thought about this a bit more and realised even keeping min-variance (of each read) could still be completely wrong. Imagine your benchmarked code has a number of distinct branches; then there is no reason to expect the cycles from these branches to correlate. The min-variance approach is only meaningful if it's the same code-path. |
Updated diff
llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp | ||
---|---|---|
42 | Yes, they're supposed to be homogenous, representing the measurements of the same block sampled at fixed rate. For the LBR specifically:
[0] I think this is where it's a bit "wrong" right now. We always only read the last 16 branches. What we probably want is to be able to have the BenchmarkRunner pause at fixed period and take a sample. | |
56 | ¯\_(ツ)_/¯ indeed! |
I've added a couple of style comment + one bigger comment on the aggregation of results from multiple runs/counter buffer.
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp | ||
---|---|---|
56 | NewValues should be a const reference, no? | |
60 | s/reserved/Reserved/ | |
65 | This is probably not a big deal, but consider: const size_t NumValues = std::max(NewValues.size(), Result->size()); if (NumValues > Result->size()) Result->resize(NumValues, 0); for (size_t i = 0, End = NewValues.size(); i < End; ++i) { (*Result)[i] += NewValues[i]; } for unrolled + vectorized code. | |
74 | I'd prefer not using assignment in function arguments - this might be easily overlooked or mistaken for a bug (assignment in place of equality comparison). | |
llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp | ||
35 | std::pow is somewhat heavy-weight as it is not optimized for integer exponents, | |
47 | A shorter way to write this: return std::accumulate( Values.begin(), Values.end(), std::numeric_limits<int64_t>::max(), [](int64_t A, int64_t B) { return A < B ? A : B; }); And similar for FindMax below. | |
60 | Nit: Since we have WithMinVariance, consider renaming Variance to MinVariance, and CurVariance to just Variance. | |
61 | Ideally, this should use move semantics (if they are supported by SmallVector). | |
62 | This would update Variance (MinVariance with the rename proposed above) even if the variance increased. What you probably want is if (Variance < MinVariance) { WithMinVariance = *ExpectedCounterValues; MinVariancec = Variance; } | |
82 | I'd still split this into two arguments: And another one that decides what to do with the results of the previous step (Min, Max, Mean, return as is}. With the current MinVariance filtering in all cases, we're changing the behavior for scalar counters, where we might drop some values (and we might drop some values also in the LBR case). | |
llvm/tools/llvm-exegesis/lib/PerfHelper.cpp | ||
151 | This should also be llvm::SmallVector (please test with HAVE_LIBPFM undefined/false). |
Updated diff
llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp | ||
---|---|---|
82 | Ah, I think we can infer how to accumulate the values returnt by runAndSample().
|
One last comment, but otherwise this looks good. I'll leave the approval to Clement.
llvm/tools/llvm-exegesis/lib/LatencyBenchmarkRunner.cpp | ||
---|---|---|
48 | Very nit: if we're working with doubles, std::numeric_limits<double>::infinity() might be even better. |
llvm/tools/llvm-exegesis/lib/Target.cpp | ||
---|---|---|
118 ↗ | (On Diff #273211) | I the default Target implementation does not use it, let's just do: std::unique_ptr<BenchmarkRunner> ExegesisTarget::createUopsBenchmarkRunner( const LLVMState &State, InstructionBenchmark::ResultAggregationModeE /*unused*/) const { return std::make_unique<UopsBenchmarkRunner>(State); } |
llvm/tools/llvm-exegesis/lib/Target.cpp | ||
---|---|---|
118 ↗ | (On Diff #273211) | Should we change the impl to use it? |
This should be llvm::SmallVector<int64_t, N> with some small N (e.g. 4) to avoid memory allocations when there is just one return value. This is the case for virtually all counters except the LBR.