Page MenuHomePhabricator

[llvm-exegesis] Add benchmark mode that uses LBR for more precise measurements.
Needs ReviewPublic

Authored by oontvoo on Apr 3 2020, 11:29 AM.



Add benchmark mode that uses LBR for more precise measurements.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

clang-format diff

oontvoo updated this revision to Diff 254993.Apr 3 2020, 8:21 PM

more conflict and fix accidental use of internal include paths

This isn't a new mode, it's still for measuring latency.

This is a good point. In the end what matters is that we're measuring the same thing, and e.g. analysis should not care how we're measuring this value.


Does'n this slice the Counter object ?


Should this not be guarded by HAVE_LIBPFM ? Can you please try to compile & run without libpfm support to validate ?


This shold move to the X86 exegesis target. The function is virtual, so the default implementation can return false.

And with Roman's suggestion, you don't need it at all :)


With Roman's suggestion, you can just remove this and use createLatencyBenchmarkRunner for this.


Please move this to a separate file within X86 the target.

88 ↗(On Diff #254993)

Let's move this to the X86 target, and rename it to x86-lbr-sample-period, with a default of 0. If non-zero, we're going to use LBR. Else, we'll use the non-LBR runner.

91 ↗(On Diff #254993)

Please remove references to GWP.

And thanks for the contribution.

oontvoo marked 9 inline comments as done.Apr 7 2020, 1:41 PM
oontvoo added inline comments.

Yes, it would've ... Changed to Counter*


Yes - it should be macro guarded. (I mindlessly deleted the guards while moving things around)
Added it back.

oontvoo updated this revision to Diff 255793.Apr 7 2020, 1:41 PM
oontvoo marked 2 inline comments as done.

Addressed review comments:

  • remove the new mode and merge it into latency
  • various refactorings and comments-cleanup per courbet@'s requests
oontvoo updated this revision to Diff 255807.Apr 7 2020, 2:13 PM

Fix typos and rebase

oontvoo updated this revision to Diff 255837.Apr 7 2020, 3:26 PM

Fix clang-tidy warning

courbet added inline comments.Apr 8 2020, 6:57 AM

Can you expand on what "hiding blocks" means ? e.g. an external reference ?


Apart from the PerfEvent, this does not really have anything to do with LBR.
On the one hand, this is not really a full-fledged FunctionExecutor, as it just throws away the CounterName argument and always measures the latency. This would be an argument for moving this to the X86 backend.
On the other hand, this has a lot of target-independent code, so it's a good thing that other targets can use it. The main issue, and the only thing that's really specific to LBR here is the perf event. Letting the target create the perf event as well as the counter does fix this.

So what we can do:

  • change Target::createCounter to take a counter name rather than a PerfEvent. The default implementation creates the PerfEvent, validates it and wraps it in a Counter. The X86 implementation tests if we're asking for the cycles counter in LBR mode, and creates the relevant PerfEvent and LBRCounter. It falls back to the default implementation else.
  • We stop ignoring the counter name argument in LbrFunctionExecutorImpl::runAndMeasure, and just propagate it to the ExegesisTarget.
  • We rename LbrFunctionExecutorImpl into WorkerThreadFunctionExecutorImpl (and FunctionExecutorImpl could be renamed to SameThreadFunctionExecutor, and everything becomes target agnostic.

Again, this is too specific to X86. I think it's one of these cases where is actually makes more sense for the generic code to know nothing about LBR. What about:

  • Moving LbrSamplePeriod from llvm-exegesis to X86/Target.cpp
  • Making createFunctionExecutor() a virtual member of the target. The default implementation returns a FunctionExecutorImpl.

The LLVM style is FIXME


do we really need all these ?


[nit] s/PFM package available../HAVE_LIBPFM/ the package might be available but specifically disabled


Does this work under windows ? If not, please use one of the LLVM wrappers.


internal functions use a static keyword instead of anonymous namespace in the LLVM style guide.


s/time out/timeout




This would benefit from a pointer to documentation (or argued speculation about) whether rewriting code inside a loop for another thread is valid and how it behaves w.r.t. caches and DSB.






you can remove this now


FIXME. Also, please give more explanations so that somebody else can look at that too ;)


This file, and the cpp one, are missing the LLVM headers.

oontvoo updated this revision to Diff 256162.Apr 8 2020, 7:44 PM
oontvoo marked 9 inline comments as done.
  • Make internal helpers static functions (instead of anonymous ns members)
  • Reword confusing comments
  • Add LLVM headers
courbet added inline comments.Apr 8 2020, 11:40 PM

typo: ..


std::unique_ptr + no dtor ?


FIXME and no name :)

ondrasej added inline comments.Apr 9 2020, 7:30 AM

The LBR on Haswell has only the addresses of the branches. To get the timing, Skylake is the minimal requirement.


I'd add a mention that this applies to the LBR ratency mode.


As noted above - Haswell and Broadwell have the LBR, but it does not report the number of cycles between the branches.

oontvoo updated this revision to Diff 256476.Apr 9 2020, 6:58 PM
oontvoo marked 12 inline comments as done.

Addressed review comments.

oontvoo updated this revision to Diff 256477.Apr 9 2020, 7:02 PM
oontvoo marked 5 inline comments as done.

Add more comment

Thanks for the thorough reviews! :-)


Done - instead of moving the executor out, I've just let the Target decides whether to use WorkerThread executor or otherwise.


Per offline discussion, we'll ifdef this out for non-linux and will add support for windows in follow-up CLs


I meant to rewrite this hack to use a condition var for stopping the loop from another thread, per Ondrej's suggestion .
I just haven't got to it yet :-) ... (would need more changes in the assembler.cpp )

So I'll add a FIXME for next CLs


Per offline discussions, this requires re-working the read() interface because we need to return more than one value.
(Specifically, the lbr-entries offer a list of cycles-latencies for a bunch of consec branches ... we don't want to just ignore the older branches and only look at the most recent)

But because this patch is already quite big , we'll keep this as TODO and follow up.

courbet added inline comments.Apr 10 2020, 6:07 AM

That forces passing the RunArg to the target, which has no reason to know about it.

What I'm suggesting is the following:

  • move LbrSamplePeriod to X86/Target.cpp. This whole SamplePeriod thing, and the existence of the option is really just a X86 detail, the target-independent code has no need to know about any of this.
  • Get rid of SamplePeriod, in the target-independant code. RunArg can go too.
  • createCounter does not need Description, or SamplePeriod. The X86 target handles it internally.

This should really onely be:

createCounter(const char *CounterName) {
  if (CounterName == kCyclesCounter && LbrSamplingPeriod > 0) {
    return std::make_unique<X86LbrCounter>(PerfEvent("LBR", LbrSamplingPeriod));
  return ExegesisTarget::createCounter(CounterName);

This will fail to compile under windows because X86LbrCounter does not exist there.

348 ↗(On Diff #256477)

This compile-time guarded runtime check can now go, because the x86 specific concepts are handled in the x86 target.

oontvoo updated this revision to Diff 256682.Apr 10 2020, 3:21 PM
oontvoo marked 2 inline comments as done.

Addressed review comments:

  • remove no-longer needed RunArg
  • move target-specific to X86/Target
oontvoo updated this revision to Diff 258122.Apr 16 2020, 12:06 PM
oontvoo marked 3 inline comments as done.

Add check for the LBR format to ensure it contains cycles.

oontvoo updated this revision to Diff 260000.Apr 24 2020, 2:41 PM

Update perf_event_attr to use BR_INST_RETIRED.NEAR_TAKEN per eranian@'s recommendation.

oontvoo updated this revision to Diff 260014.Apr 24 2020, 3:51 PM

ran clang-format

oontvoo updated this revision to Diff 260733.Tue, Apr 28, 12:41 PM

revert unrelated change

oontvoo updated this revision to Diff 260790.Tue, Apr 28, 4:10 PM

Fix dup options

oontvoo updated this revision to Diff 261015.Wed, Apr 29, 1:32 PM

Add new cpp file to makefile

oontvoo updated this revision to Diff 262921.Fri, May 8, 11:33 AM

Update perf-even-attr for raw-encoding

oontvoo updated this revision to Diff 265310.Wed, May 20, 11:18 AM
oontvoo updated this revision to Diff 265311.Wed, May 20, 11:38 AM
oontvoo updated this revision to Diff 265315.Wed, May 20, 11:53 AM

Updated patch to cover what we discussed in the sync last week and this week, specifically:

  • Remove the worker-thread executor (not needed. It's fine to use the same thread)
  • Make PerfEvent virtual so the X86Counter can create its own PerfEvent with LBR-specific attributes.

Remaining questions:

  • Counter should be able to return more than one readings.
  • We should validate that repetion-mode=loop for LBR