Add benchmark mode that uses LBR for more precise measurements.
Diff Detail
Event Timeline
llvm/tools/llvm-exegesis/llvm-exegesis.cpp | ||
---|---|---|
88 ↗ | (On Diff #254993) | Let's move this to the X86 target, and rename it to x86-lbr-sample-period, with a default of 0. If non-zero, we're going to use LBR. Else, we'll use the non-LBR runner. |
Addressed review comments:
- remove the new mode and merge it into latency
- various refactorings and comments-cleanup per courbet@'s requests
llvm/docs/CommandGuide/llvm-exegesis.rst | ||
---|---|---|
210 | Can you expand on what "hiding blocks" means ? e.g. an external reference ? | |
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp | ||
91 | Apart from the PerfEvent, this does not really have anything to do with LBR. So what we can do:
| |
199 | Again, this is too specific to X86. I think it's one of these cases where is actually makes more sense for the generic code to know nothing about LBR. What about:
| |
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.h | ||
42 | The LLVM style is FIXME | |
llvm/tools/llvm-exegesis/lib/X86/Target.cpp | ||
32 | do we really need all these ? | |
598 | [nit] s/PFM package available../HAVE_LIBPFM/ the package might be available but specifically disabled | |
llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp | ||
17 | Does this work under windows ? If not, please use one of the LLVM wrappers. | |
30 | internal functions use a static keyword instead of anonymous namespace in the LLVM style guide. | |
35 | s/time out/timeout | |
39 | static | |
53 | This would benefit from a pointer to documentation (or argued speculation about) whether rewriting code inside a loop for another thread is valid and how it behaves w.r.t. caches and DSB. | |
57 | static | |
84 | static | |
126 | you can remove this now | |
198 | FIXME. Also, please give more explanations so that somebody else can look at that too ;) | |
llvm/tools/llvm-exegesis/lib/X86/X86Counter.h | ||
2 | This file, and the cpp one, are missing the LLVM headers. |
- Make internal helpers static functions (instead of anonymous ns members)
- Reword confusing comments
- Add LLVM headers
llvm/docs/CommandGuide/llvm-exegesis.rst | ||
---|---|---|
199 | The LBR on Haswell has only the addresses of the branches. To get the timing, Skylake is the minimal requirement. | |
207 | I'd add a mention that this applies to the LBR ratency mode. | |
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp | ||
187 | As noted above - Haswell and Broadwell have the LBR, but it does not report the number of cycles between the branches. |
Thanks for the thorough reviews! :-)
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp | ||
---|---|---|
91 | Done - instead of moving the executor out, I've just let the Target decides whether to use WorkerThread executor or otherwise. | |
llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp | ||
17 | Per offline discussion, we'll ifdef this out for non-linux and will add support for windows in follow-up CLs | |
53 | I meant to rewrite this hack to use a condition var for stopping the loop from another thread, per Ondrej's suggestion . So I'll add a FIXME for next CLs | |
198 | Per offline discussions, this requires re-working the read() interface because we need to return more than one value. But because this patch is already quite big , we'll keep this as TODO and follow up. |
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp | ||
---|---|---|
91 | That forces passing the RunArg to the target, which has no reason to know about it. What I'm suggesting is the following:
| |
llvm/tools/llvm-exegesis/lib/X86/Target.cpp | ||
619 | This should really onely be: Expected<std::unique_ptr<pfm::Counter>> createCounter(const char *CounterName) { if (CounterName == kCyclesCounter && LbrSamplingPeriod > 0) { return std::make_unique<X86LbrCounter>(PerfEvent("LBR", LbrSamplingPeriod)); } return ExegesisTarget::createCounter(CounterName); } | |
621 | This will fail to compile under windows because X86LbrCounter does not exist there. | |
llvm/tools/llvm-exegesis/llvm-exegesis.cpp | ||
348 ↗ | (On Diff #256477) | This compile-time guarded runtime check can now go, because the x86 specific concepts are handled in the x86 target. |
Addressed review comments:
- remove no-longer needed RunArg
- move target-specific to X86/Target
Update perf_event_attr to use BR_INST_RETIRED.NEAR_TAKEN per eranian@'s recommendation.
Updated patch to cover what we discussed in the sync last week and this week, specifically:
- Remove the worker-thread executor (not needed. It's fine to use the same thread)
- Make PerfEvent virtual so the X86Counter can create its own PerfEvent with LBR-specific attributes.
Remaining questions:
- Counter should be able to return more than one readings.
- We should validate that repetion-mode=loop for LBR
llvm/docs/CommandGuide/llvm-exegesis.rst | ||
---|---|---|
197 | On X86, | |
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp | ||
13 | you're no longer using <thread> and <mutex>. | |
16 | is this needed ? | |
66 | s/auto/pfm::Counter/ | |
llvm/tools/llvm-exegesis/lib/PerfHelper.h | ||
45 | why is the dtor virtual ? I don't see any virtual methods ? | |
llvm/tools/llvm-exegesis/lib/Target.h | ||
69 | Can you submit this change independently and rebase ? Thanks ! | |
llvm/tools/llvm-exegesis/lib/X86/Target.cpp | ||
593 | && LbrSamplingPeriod > 0 ? | |
llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp | ||
65 | fills | |
184 | [style] if (PollResult == -1) return llvm::make_error<llvm::StringError>("Cannot poll LBR perf event.", llvm::errc::io_error); if (PollResult != 0) continue; ... | |
185 | how often does that happen ? Do we want that much logging ? |
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp | ||
---|---|---|
63 | [style] no braces for 1-line conditional bodies | |
llvm/tools/llvm-exegesis/lib/Target.cpp | ||
36 | [style] no braces for 1-line conditional bodies | |
llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp | ||
88 | [style] no braces for 1-line conditional bodies | |
131 | [style] no braces for 1-line conditional bodies | |
144 | [style] no braces for 1-line conditional bodies | |
206 | [style] no braces for 1-line conditional bodies | |
216 | ditto | |
223 | ditto |
llvm/tools/llvm-exegesis/lib/Target.h | ||
---|---|---|
69 | ||
llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp | ||
185 | It could happen 100% of the time if the code happens to have no branches. |
llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp | ||
---|---|---|
184 | you don't need the else after the return. | |
185 | I'm not worried about the no-branch case as we can avoid that. We might as well remove it, because we don't really care if it fanally succeeded, and if it did not we can print an error there. | |
190 | you don't need the else here after the continue. |
OK, I think this is in good shape now, but we still need a test for it. The only issue is that not all buildbots support LBR, so we need a mechanism for skipping the test when not available. You can create a subdir llvm-project/llvm/test/tools/llvm-exegesis/X86/lbr with a custom lit similar to llvm-project/llvm/test/tools/llvm-exegesis/X86/lit.local.cfg to check lbr support before running the test.
Add test and option to check for expected CPU.
I can't seem to find a way to get the host CPU version (much less name or pmu capabilities) from the lit config, so I had to add the check to llvm-exegesis.
llvm/tools/llvm-exegesis/llvm-exegesis.cpp | ||
---|---|---|
291 ↗ | (On Diff #268037) | This is a bit brittle, because we could imagine the name of some unrelated CPUs being substrings of others. What about having a repeated option --allowed-host-cpu=skylake --allowed-host-cpu=skylake-avx512 --allowed-host-cpu=whateverlake, and check that the exact value is one of these ? |
llvm/docs/CommandGuide/llvm-exegesis.rst | ||
---|---|---|
197 | This is more of a philosophical comment: IIRC, the LBR records the times when the branch instructions are retired. With correct branch prediction, the instructions in the measured block would start executing speculatively before the branch is retired, and so the number we get is probably closer to the throughput. | |
211 | I'm wondering how this applies to this use case. The issue with basic blocks skipped might be when sampling all basic blocks in a larger application over a longer period of time, but this is not our case. | |
llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp | ||
88 | I think we also need to filter by the addresses of the branches (or store them) to make sure that we take the cycles only for the measured basic block, not for the other basic blocks surrounding it. | |
llvm/tools/llvm-exegesis/llvm-exegesis.cpp | ||
291 ↗ | (On Diff #268037) | Ideally, this should be defined in terms of CPU features (e.g. CPUID bits). Even better - each target should know which counters it supports, based on its platform-specific feature discovery mechanism. I understand that this would be a huge change for this CL, but we should at least have a FIXME here. |
llvm/tools/llvm-exegesis/llvm-exegesis.cpp | ||
---|---|---|
291 ↗ | (On Diff #268037) | Actually, I had a question here that I was gonna ask in an email But here goes. We know that the LBR formats could be queried from the perf-cap MSR If I'm not mistaken, perf_event_mmap_page::capabilities should(?) give us that. Thoughts? |
llvm/tools/llvm-exegesis/llvm-exegesis.cpp | ||
---|---|---|
291 ↗ | (On Diff #268037) | This is strange. I understand the SDM section about IA32_PERF_CAPABILITIES the same way, i.e. we should see 000110B if there are timings, and this should happen only on Skylake or onwards. That said, I don't think that perf_event_mmap_page::capabilities contains the IA32_PERF_CAPABILITIES MSR. Instead, it looks like a struct that is defined and filled by the kernel. Check out the definition/docs of the struct (and the code that fills it). They set individual bit fields (and even the bit field defined in the union does not match the other bits defined in the MSR). After some digging through kernel sources, I see where the kernel keeps the data internally, but I didn't see a way to get it out of the kernel. We might need to do more research on that. |
Only style comments left
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp | ||
---|---|---|
81 | let's use a c++ cast here. | |
83 | Actually I'm thinking you can just pass a StringRef to readOrError | |
llvm/tools/llvm-exegesis/lib/PerfHelper.h | ||
76–77 | Please explain what the arguments are (and switch to StringRef as mentioned above ?) | |
llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp | ||
150 | braces | |
205 | style: no braces | |
llvm/tools/llvm-exegesis/lib/X86/X86Counter.h | ||
9 | This could use a file comment with a link to LBR documentation. |
llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp | ||
---|---|---|
179 | Let's return an error when FunctionBytes.empty(). Right now this will assert later. |
Looks good overall.
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp | ||
---|---|---|
51–52 | [style] Remove the empty line. |
This patch doesn't seem to build for me:
/iusers/ekeane1/workspaces/llvm-project/llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp: In function ‘llvm::Error llvm::exegesis::parseDataBuffer(const char*, size_t, const void*, const void*, llvm::SmallVector<long int, 4>*)’:
/iusers/ekeane1/workspaces/llvm-project/llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp:99:37: error: ‘struct perf_branch_entry’ has no member named ‘cycles’
CycleArray->push_back(Entry.cycles);
I'm on RHEL7, so I have kernel 3.10, so it doesn't have 'cycles'.
According ot this: https://elixir.bootlin.com/linux/v4.3/source/include/uapi/linux/perf_event.h#L963 kernel 4.3 is the first time that 'cycles' appeared in this structure.
If you're not using llvm-exegesis you can disable this with the LLVM_ENABLE_LIBPFM cmake option.
But I'll revert this until @oontvoo can fix llvm/cmake/modules/FindLibpfm.cmake
On X86,