This is an archive of the discontinued LLVM Phabricator instance.

[llvm-profdata] Speed up merging by using a thread pool
ClosedPublic

Authored by vsk on Jul 16 2016, 10:41 AM.

Details

Summary

Add a "-j" option to llvm-profdata to control the number of threads
used. Auto-detect NumThreads when it isn't specified, and avoid spawning
threads when they wouldn't be beneficial.

I tested this patch using a raw profile produced by clang (147MB). Here is the
time taken to merge 4 copies together on my laptop:

buildusersystemcputotal
No thread pool112.87s5.92s97%2:01.08
With 2 threads134.99s26.54s164%1:33.31

Diff Detail

Repository
rL LLVM

Event Timeline

vsk updated this revision to Diff 64229.Jul 16 2016, 10:41 AM
vsk retitled this revision from to [llvm-profdata] Speed up merging by using a thread pool.
vsk updated this object.
vsk added reviewers: davidxl, silvas.
vsk updated this object.
vsk added a subscriber: llvm-commits.
silvas edited edge metadata.Jul 16 2016, 3:32 PM

The way you arrived at sqrt(N) seems sketchy to me.
Assuming infinite parallelism, the first step performs N/2 merges and then you recurse on the N/2 merged profiles:

numMerges(N) = N/2 + numMerges(N/2)
numSerialMerges(N) = 1 + numSerialMerges(N/2)

Which must work out to something like log(N) serial merges. I.e. a binary reduction.

davidxl edited edge metadata.Jul 17 2016, 9:56 AM

There isn't any reduction of the number of merges compared with serial case. Say there are T threads, the total number of merges is

T * f(N/T) + T - 1, so f(N) is still N-1.

I suspect a much larger speed up can be obtained by unifying symbol table creation across profiles (that are generated from the same binary -- as their symtab is identical). I was planning doing that but did not find time. If you can help with that, that will be a great win.

David

tools/llvm-profdata/llvm-profdata.cpp
138 ↗(On Diff #64229)

This does not look right. Just make it a controllable option (default with 2 or 4 etc).

143 ↗(On Diff #64229)

Move struct definition out of the function.

161 ↗(On Diff #64229)

Make this a regular function and move it out of line.

vsk added a comment.Jul 17 2016, 1:46 PM

The way you arrived at sqrt(N) seems sketchy to me.

I didn't consider the option of splitting the input by twos. You're right that it's possible to do better than T - 1 + (N/T) serial merges.

Assuming infinite parallelism, the first step performs N/2 merges and then you recurse on the N/2 merged profiles [...]. Which must work out to something like log(N) serial merges.

In practice (dropping the assumption of infinite parallelism), I don't think we can get to log(N) serial merges. You have to at least visit each input once (using any one of T threads), so N/T serial operations looks like the lower bound.

That said, your suggested approach seems better and I'll take a shot at it.

There isn't any reduction of the number of merges compared with serial case. Say there are T threads, the total number of merges is: T * f(N/T) + T - 1, so f(N) is still N-1.

No disagreement here. We will count the same (or higher) number of merges in total when using T threads. It's strictly meant as a speed improvement.

I suspect a much larger speed up can be obtained by unifying symbol table creation across profiles (that are generated from the same binary -- as their symtab is identical). I was planning doing that but did not find time. If you can help with that, that will be a great win.

That sounds like a great idea. Did you have in mind some sort of cache of InstrProfSymtabs which can optionally be passed to the raw profile reader? I can investigate that after this patch is done.

vsk updated this revision to Diff 64330.Jul 18 2016, 8:37 AM
vsk updated this object.
vsk edited edge metadata.
  • Address Sean and David's review comments.
  • Add tests for {even, odd} number of threads x {even, odd} number of inputs.
vsk marked 3 inline comments as done.Jul 18 2016, 8:38 AM

Mark some inline comments as done.

vsk updated this revision to Diff 64333.Jul 18 2016, 8:43 AM
  • Remove stray semi-colon.
davidxl accepted this revision.Jul 18 2016, 2:29 PM
davidxl edited edge metadata.

lgtm

Probably also add a test as a follow up to test option -num-threads=1 and some other non-default value.

tools/llvm-profdata/llvm-profdata.cpp
173 ↗(On Diff #64333)

--> Merge Src context into Dst.

This revision is now accepted and ready to land.Jul 18 2016, 2:29 PM
silvas accepted this revision.Jul 18 2016, 2:59 PM
silvas edited edge metadata.

LGTM.

vsk marked an inline comment as done.Jul 18 2016, 3:04 PM

lgtm

Probably also add a test as a follow up to test option -num-threads=1 and some other non-default value.

Thanks. I'll add the test.

This revision was automatically updated to reflect the committed changes.