This is an archive of the discontinued LLVM Phabricator instance.

[llvm-exegesis] Add benchmark mode that uses LBR for more precise measurements.
ClosedPublic

Authored by oontvoo on Apr 3 2020, 11:29 AM.

Details

Summary

Add benchmark mode that uses LBR for more precise measurements.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
courbet added inline comments.Apr 6 2020, 4:28 AM
llvm/tools/llvm-exegesis/llvm-exegesis.cpp
88

Let's move this to the X86 target, and rename it to x86-lbr-sample-period, with a default of 0. If non-zero, we're going to use LBR. Else, we'll use the non-LBR runner.

And thanks for the contribution.

oontvoo marked 9 inline comments as done.Apr 7 2020, 1:41 PM
oontvoo added inline comments.
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
106

Yes, it would've ... Changed to Counter*

llvm/tools/llvm-exegesis/lib/PerfHelper.cpp
91

Yes - it should be macro guarded. (I mindlessly deleted the guards while moving things around)
Added it back.

oontvoo updated this revision to Diff 255793.Apr 7 2020, 1:41 PM
oontvoo marked 2 inline comments as done.

Addressed review comments:

  • remove the new mode and merge it into latency
  • various refactorings and comments-cleanup per courbet@'s requests
oontvoo updated this revision to Diff 255807.Apr 7 2020, 2:13 PM

Fix typos and rebase

oontvoo updated this revision to Diff 255837.Apr 7 2020, 3:26 PM

Fix clang-tidy warning

courbet added inline comments.Apr 8 2020, 6:57 AM
llvm/docs/CommandGuide/llvm-exegesis.rst
220

Can you expand on what "hiding blocks" means ? e.g. an external reference ?

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
87

Apart from the PerfEvent, this does not really have anything to do with LBR.
On the one hand, this is not really a full-fledged FunctionExecutor, as it just throws away the CounterName argument and always measures the latency. This would be an argument for moving this to the X86 backend.
On the other hand, this has a lot of target-independent code, so it's a good thing that other targets can use it. The main issue, and the only thing that's really specific to LBR here is the perf event. Letting the target create the perf event as well as the counter does fix this.

So what we can do:

  • change Target::createCounter to take a counter name rather than a PerfEvent. The default implementation creates the PerfEvent, validates it and wraps it in a Counter. The X86 implementation tests if we're asking for the cycles counter in LBR mode, and creates the relevant PerfEvent and LBRCounter. It falls back to the default implementation else.
  • We stop ignoring the counter name argument in LbrFunctionExecutorImpl::runAndMeasure, and just propagate it to the ExegesisTarget.
  • We rename LbrFunctionExecutorImpl into WorkerThreadFunctionExecutorImpl (and FunctionExecutorImpl could be renamed to SameThreadFunctionExecutor, and everything becomes target agnostic.
195

Again, this is too specific to X86. I think it's one of these cases where is actually makes more sense for the generic code to know nothing about LBR. What about:

  • Moving LbrSamplePeriod from llvm-exegesis to X86/Target.cpp
  • Making createFunctionExecutor() a virtual member of the target. The default implementation returns a FunctionExecutorImpl.
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.h
49

The LLVM style is FIXME

llvm/tools/llvm-exegesis/lib/X86/Target.cpp
45

do we really need all these ?

760

[nit] s/PFM package available../HAVE_LIBPFM/ the package might be available but specifically disabled

llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp
16 ↗(On Diff #255837)

Does this work under windows ? If not, please use one of the LLVM wrappers.

29 ↗(On Diff #255837)

internal functions use a static keyword instead of anonymous namespace in the LLVM style guide.

34 ↗(On Diff #255837)

s/time out/timeout

38 ↗(On Diff #255837)

static

52 ↗(On Diff #255837)

This would benefit from a pointer to documentation (or argued speculation about) whether rewriting code inside a loop for another thread is valid and how it behaves w.r.t. caches and DSB.

56 ↗(On Diff #255837)

static

83 ↗(On Diff #255837)

static

125 ↗(On Diff #255837)

you can remove this now

197 ↗(On Diff #255837)

FIXME. Also, please give more explanations so that somebody else can look at that too ;)

llvm/tools/llvm-exegesis/lib/X86/X86Counter.h
1 ↗(On Diff #255837)

This file, and the cpp one, are missing the LLVM headers.

oontvoo updated this revision to Diff 256162.Apr 8 2020, 7:44 PM
oontvoo marked 9 inline comments as done.
  • Make internal helpers static functions (instead of anonymous ns members)
  • Reword confusing comments
  • Add LLVM headers
courbet added inline comments.Apr 8 2020, 11:40 PM
llvm/docs/CommandGuide/llvm-exegesis.rst
220

typo: ..

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
104

std::unique_ptr + no dtor ?

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.h
49

FIXME and no name :)

ondrasej added inline comments.Apr 9 2020, 7:30 AM
llvm/docs/CommandGuide/llvm-exegesis.rst
209

The LBR on Haswell has only the addresses of the branches. To get the timing, Skylake is the minimal requirement.

217

I'd add a mention that this applies to the LBR ratency mode.

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
183

As noted above - Haswell and Broadwell have the LBR, but it does not report the number of cycles between the branches.

oontvoo updated this revision to Diff 256476.Apr 9 2020, 6:58 PM
oontvoo marked 12 inline comments as done.

Addressed review comments.

oontvoo updated this revision to Diff 256477.Apr 9 2020, 7:02 PM
oontvoo marked 5 inline comments as done.

Add more comment

Thanks for the thorough reviews! :-)

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
87

Done - instead of moving the executor out, I've just let the Target decides whether to use WorkerThread executor or otherwise.

llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp
16 ↗(On Diff #255837)

Per offline discussion, we'll ifdef this out for non-linux and will add support for windows in follow-up CLs

52 ↗(On Diff #255837)

I meant to rewrite this hack to use a condition var for stopping the loop from another thread, per Ondrej's suggestion .
I just haven't got to it yet :-) ... (would need more changes in the assembler.cpp )

So I'll add a FIXME for next CLs

197 ↗(On Diff #255837)

Per offline discussions, this requires re-working the read() interface because we need to return more than one value.
(Specifically, the lbr-entries offer a list of cycles-latencies for a bunch of consec branches ... we don't want to just ignore the older branches and only look at the most recent)

But because this patch is already quite big , we'll keep this as TODO and follow up.

courbet added inline comments.Apr 10 2020, 6:07 AM
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
87

That forces passing the RunArg to the target, which has no reason to know about it.

What I'm suggesting is the following:

  • move LbrSamplePeriod to X86/Target.cpp. This whole SamplePeriod thing, and the existence of the option is really just a X86 detail, the target-independent code has no need to know about any of this.
  • Get rid of SamplePeriod, in the target-independant code. RunArg can go too.
  • createCounter does not need Description, or SamplePeriod. The X86 target handles it internally.
llvm/tools/llvm-exegesis/lib/X86/Target.cpp
781

This should really onely be:

Expected<std::unique_ptr<pfm::Counter>>
createCounter(const char *CounterName) {
  if (CounterName == kCyclesCounter && LbrSamplingPeriod > 0) {
    return std::make_unique<X86LbrCounter>(PerfEvent("LBR", LbrSamplingPeriod));
  }
  return ExegesisTarget::createCounter(CounterName);
}
783

This will fail to compile under windows because X86LbrCounter does not exist there.

llvm/tools/llvm-exegesis/llvm-exegesis.cpp
342

This compile-time guarded runtime check can now go, because the x86 specific concepts are handled in the x86 target.

oontvoo updated this revision to Diff 256682.Apr 10 2020, 3:21 PM
oontvoo marked 2 inline comments as done.

Addressed review comments:

  • remove no-longer needed RunArg
  • move target-specific to X86/Target
oontvoo updated this revision to Diff 258122.Apr 16 2020, 12:06 PM
oontvoo marked 3 inline comments as done.

Add check for the LBR format to ensure it contains cycles.

oontvoo updated this revision to Diff 260000.Apr 24 2020, 2:41 PM

Update perf_event_attr to use BR_INST_RETIRED.NEAR_TAKEN per eranian@'s recommendation.

oontvoo updated this revision to Diff 260014.Apr 24 2020, 3:51 PM

ran clang-format

oontvoo updated this revision to Diff 260733.Apr 28 2020, 12:41 PM

revert unrelated change

oontvoo updated this revision to Diff 260790.Apr 28 2020, 4:10 PM

Fix dup options

oontvoo updated this revision to Diff 261015.Apr 29 2020, 1:32 PM

Add new cpp file to makefile

oontvoo updated this revision to Diff 262921.May 8 2020, 11:33 AM

Update perf-even-attr for raw-encoding

oontvoo updated this revision to Diff 265310.May 20 2020, 11:18 AM
oontvoo updated this revision to Diff 265311.May 20 2020, 11:38 AM
oontvoo updated this revision to Diff 265315.May 20 2020, 11:53 AM

Updated patch to cover what we discussed in the sync last week and this week, specifically:

  • Remove the worker-thread executor (not needed. It's fine to use the same thread)
  • Make PerfEvent virtual so the X86Counter can create its own PerfEvent with LBR-specific attributes.

Remaining questions:

  • Counter should be able to return more than one readings.
  • We should validate that repetion-mode=loop for LBR
courbet added inline comments.May 25 2020, 11:31 PM
llvm/docs/CommandGuide/llvm-exegesis.rst
207

On X86,

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
13

you're no longer using <thread> and <mutex>.

16

is this needed ?

69

s/auto/pfm::Counter/

llvm/tools/llvm-exegesis/lib/PerfHelper.h
45

why is the dtor virtual ? I don't see any virtual methods ?

llvm/tools/llvm-exegesis/lib/Target.h
69

Can you submit this change independently and rebase ? Thanks !

llvm/tools/llvm-exegesis/lib/X86/Target.cpp
755

&& LbrSamplingPeriod > 0 ?

llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp
64 ↗(On Diff #265315)

fills

183 ↗(On Diff #265315)

[style]

if (PollResult == -1)
  return llvm::make_error<llvm::StringError>("Cannot poll LBR perf event.", llvm::errc::io_error);
if (PollResult != 0) 
  continue;
...
184 ↗(On Diff #265315)

how often does that happen ? Do we want that much logging ?

courbet added inline comments.May 25 2020, 11:31 PM
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
66

[style] no braces for 1-line conditional bodies

llvm/tools/llvm-exegesis/lib/Target.cpp
34

[style] no braces for 1-line conditional bodies

llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp
87 ↗(On Diff #265315)

[style] no braces for 1-line conditional bodies

130 ↗(On Diff #265315)

[style] no braces for 1-line conditional bodies

143 ↗(On Diff #265315)

[style] no braces for 1-line conditional bodies

205 ↗(On Diff #265315)

[style] no braces for 1-line conditional bodies

215 ↗(On Diff #265315)

ditto

222 ↗(On Diff #265315)

ditto

oontvoo updated this revision to Diff 266422.May 26 2020, 10:14 PM
oontvoo marked 17 inline comments as done.

Addressed review comments

oontvoo marked an inline comment as done.May 26 2020, 10:16 PM
oontvoo added inline comments.
llvm/tools/llvm-exegesis/lib/Target.h
69
llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp
184 ↗(On Diff #265315)

It could happen 100% of the time if the code happens to have no branches.
With "proper" input, it happens less frequently. I don't really have a sense of how often ... it's probably be fine to change this to llvm::debug() or something?

courbet added inline comments.May 26 2020, 11:11 PM
llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp
184 ↗(On Diff #265315)

I'm not worried about the no-branch case as we can avoid that. We might as well remove it, because we don't really care if it fanally succeeded, and if it did not we can print an error there.

183 ↗(On Diff #266422)

you don't need the else after the return.

189 ↗(On Diff #266422)

you don't need the else here after the continue.

oontvoo updated this revision to Diff 266955.May 28 2020, 11:03 AM

Rebase to reduce diff

oontvoo updated this revision to Diff 266966.May 28 2020, 11:32 AM
oontvoo marked 5 inline comments as done.

Simplified re-try

oontvoo updated this revision to Diff 266973.May 28 2020, 11:39 AM

OK, I think this is in good shape now, but we still need a test for it. The only issue is that not all buildbots support LBR, so we need a mechanism for skipping the test when not available. You can create a subdir llvm-project/llvm/test/tools/llvm-exegesis/X86/lbr with a custom lit similar to llvm-project/llvm/test/tools/llvm-exegesis/X86/lit.local.cfg to check lbr support before running the test.

oontvoo updated this revision to Diff 268012.Jun 2 2020, 3:48 PM

Add test and option to check for expected CPU.

I can't seem to find a way to get the host CPU version (much less name or pmu capabilities) from the lit config, so I had to add the check to llvm-exegesis.

oontvoo updated this revision to Diff 268037.Jun 2 2020, 6:54 PM
This comment was removed by oontvoo.
courbet added inline comments.Jun 2 2020, 11:36 PM
llvm/tools/llvm-exegesis/llvm-exegesis.cpp
292

This is a bit brittle, because we could imagine the name of some unrelated CPUs being substrings of others. What about having a repeated option --allowed-host-cpu=skylake --allowed-host-cpu=skylake-avx512 --allowed-host-cpu=whateverlake, and check that the exact value is one of these ?

ondrasej added inline comments.Jun 3 2020, 10:09 AM
llvm/docs/CommandGuide/llvm-exegesis.rst
207

This is more of a philosophical comment: IIRC, the LBR records the times when the branch instructions are retired. With correct branch prediction, the instructions in the measured block would start executing speculatively before the branch is retired, and so the number we get is probably closer to the throughput.

221

I'm wondering how this applies to this use case. The issue with basic blocks skipped might be when sampling all basic blocks in a larger application over a longer period of time, but this is not our case.
In our case, we're interested in a single basic block, and run it in a tight loop for relatively long, so that the LBR sampling catches this exact basic block.

llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp
87 ↗(On Diff #268037)

I think we also need to filter by the addresses of the branches (or store them) to make sure that we take the cycles only for the measured basic block, not for the other basic blocks surrounding it.

llvm/tools/llvm-exegesis/llvm-exegesis.cpp
292

Ideally, this should be defined in terms of CPU features (e.g. CPUID bits). Even better - each target should know which counters it supports, based on its platform-specific feature discovery mechanism. I understand that this would be a huge change for this CL, but we should at least have a FIXME here.

oontvoo added inline comments.Jun 3 2020, 11:42 AM
llvm/tools/llvm-exegesis/llvm-exegesis.cpp
292

Actually, I had a question here that I was gonna ask in an email But here goes.

We know that the LBR formats could be queried from the perf-cap MSR
Specifically, we want: MSR IA32_PERF_CAPABILITIES[5:0]" == 000110B (bit 59...63 is not relevant)

If I'm not mistaken, perf_event_mmap_page::capabilities should(?) give us that.
Except, when I run this on both Broadwell and Skylake, the capabilities field has value of 30 for both platforms. (It shouldn't be). Of course, the difference here is that the cycle entries are all zeroes on Broadwell .
I haven't looked in details, so I don't know if this is some implementation detail of the libpfm or if we're just mis-interpreting the SDM here.

Thoughts?

ondrasej added inline comments.Jun 4 2020, 3:28 AM
llvm/tools/llvm-exegesis/llvm-exegesis.cpp
292

This is strange.

I understand the SDM section about IA32_PERF_CAPABILITIES the same way, i.e. we should see 000110B if there are timings, and this should happen only on Skylake or onwards.

That said, I don't think that perf_event_mmap_page::capabilities contains the IA32_PERF_CAPABILITIES MSR. Instead, it looks like a struct that is defined and filled by the kernel. Check out the definition/docs of the struct (and the code that fills it). They set individual bit fields (and even the bit field defined in the union does not match the other bits defined in the MSR).

After some digging through kernel sources, I see where the kernel keeps the data internally, but I didn't see a way to get it out of the kernel. We might need to do more research on that.

oontvoo updated this revision to Diff 271558.Jun 17 2020, 8:09 PM

Filter and keep only records for the benchmarked codeblock

oontvoo updated this revision to Diff 271800.Jun 18 2020, 11:36 AM
oontvoo marked an inline comment as done.

Add option to not filter the records too

oontvoo updated this revision to Diff 273877.Jun 26 2020, 6:43 PM
oontvoo marked 5 inline comments as done.

Updated diff

oontvoo updated this revision to Diff 273878.Jun 26 2020, 7:22 PM

clang-tidy

Only style comments left

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
78

let's use a c++ cast here.

80

Actually I'm thinking you can just pass a StringRef to readOrError

llvm/tools/llvm-exegesis/lib/PerfHelper.h
84–85

Please explain what the arguments are (and switch to StringRef as mentioned above ?)

llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp
149 ↗(On Diff #273878)

braces

204 ↗(On Diff #273878)

style: no braces

llvm/tools/llvm-exegesis/lib/X86/X86Counter.h
8 ↗(On Diff #273878)

This could use a file comment with a link to LBR documentation.

oontvoo updated this revision to Diff 276122.Jul 7 2020, 10:17 AM
oontvoo marked 6 inline comments as done.

updated diff

courbet added inline comments.Jul 8 2020, 1:07 AM
llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp
178 ↗(On Diff #276122)

Let's return an error when FunctionBytes.empty(). Right now this will assert later.

oontvoo updated this revision to Diff 276464.Jul 8 2020, 9:27 AM
oontvoo marked an inline comment as done.

Return error for empty function

Looks good overall.

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
50–51

[style] Remove the empty line.

courbet accepted this revision.Jul 8 2020, 10:59 PM
This revision is now accepted and ready to land.Jul 8 2020, 10:59 PM
oontvoo updated this revision to Diff 276830.Jul 9 2020, 1:46 PM
oontvoo marked an inline comment as done.

Removed empty line

oontvoo updated this revision to Diff 276842.Jul 9 2020, 2:40 PM

Fixed warning [-Wmissing-field-initializers] in code

oontvoo closed this revision.Jul 16 2020, 9:19 AM

This patch doesn't seem to build for me:
/iusers/ekeane1/workspaces/llvm-project/llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp: In function ‘llvm::Error llvm::exegesis::parseDataBuffer(const char*, size_t, const void*, const void*, llvm::SmallVector<long int, 4>*)’:
/iusers/ekeane1/workspaces/llvm-project/llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp:99:37: error: ‘struct perf_branch_entry’ has no member named ‘cycles’

CycleArray->push_back(Entry.cycles);

I'm on RHEL7, so I have kernel 3.10, so it doesn't have 'cycles'.

According ot this: https://elixir.bootlin.com/linux/v4.3/source/include/uapi/linux/perf_event.h#L963 kernel 4.3 is the first time that 'cycles' appeared in this structure.

If you're not using llvm-exegesis you can disable this with the LLVM_ENABLE_LIBPFM cmake option.
But I'll revert this until @oontvoo can fix llvm/cmake/modules/FindLibpfm.cmake

oontvoo reopened this revision.Jul 17 2020, 3:54 PM
This revision is now accepted and ready to land.Jul 17 2020, 3:54 PM
oontvoo updated this revision to Diff 280021.Jul 22 2020, 8:47 PM

Check for existence of cycles before enabling LBR mode.