This is an archive of the discontinued LLVM Phabricator instance.

[NFCI][llvm-exegesis] Benchmark: parallelize codegen (5x ... 8x less wallclock)
Needs ReviewPublic

Authored by lebedev.ri on Dec 18 2022, 9:39 AM.

Details

Summary

(i'd be fine just committing this, but just in case, for greater visibility if nothing else, decided to post)

We *might* not want to perform codegen for all Configurations X Repetitions
beforehand, since the produced Runnable Configurations may have significant
file sizes (up to 1MB?), and there are many Runnable Configurations (30k?).
But doing batches is fine memory-wise, and is a win, as expected.

We really don't want to smudge the measurements, so we do those
standalone, without running *anything* else in parallel,
but when not measuring, the codegen can be done in parallel.

Special care is taken to not produce snippets in non-deterministic order,
although snippets themselves are randomized, it's not as useful.

And so it becomes almost real-time:

time ./bin/llvm-exegesis --opcode-index=-1 --mode=latency --repetition-mode=duplicate --dump-object-to-disk=0 --benchmarks-file=/tmp/res-new.yaml --measurements-print-progress --max-configs-per-opcode=8192

old:

real    1m33.500s
user    1m29.644s
sys     0m1.762s

new:

real    0m18.191s
user    3m8.253s
sys     0m3.999s

(5.1x)

time ./bin/llvm-exegesis --opcode-index=-1 --mode=uops --repetition-mode=duplicate --dump-object-to-disk=0 --benchmarks-file=/tmp/res-new.yaml --measurements-print-progress

old:

real    1m52.256s
user    1m48.518s
sys     0m1.479s

new:

real    0m13.273s
user    4m14.228s
sys     0m4.903s

(8.5x)

time ./bin/llvm-exegesis --opcode-index=-1 --mode=inverse_throughput --repetition-mode=duplicate --dump-object-to-disk=0 --benchmarks-file=/tmp/res-new.yaml --measurements-print-progress

old:

real    1m58.765s
user    1m53.259s
sys     0m2.937s

new:

real    0m19.586s
user    4m19.133s
sys     0m6.314s

(6x)

See also: https://discourse.llvm.org/t/does-anyone-use-llvm-exegesis-feedback-wanted/67729/

Diff Detail

Event Timeline

lebedev.ri created this revision.Dec 18 2022, 9:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 18 2022, 9:39 AM
Herald added a subscriber: mstojanovic. · View Herald Transcript
lebedev.ri requested review of this revision.Dec 18 2022, 9:39 AM
lebedev.ri edited reviewers, added: gchatelet; removed: gchakrabarti.

Note: my main question here, is if we want any knobs for this?

Does anyone have any thoughts here?
It's fine if nobody is comfortable reviewing parallelizm,
just general user-facing side review is fine.

Add the flags.

To answer the question i know will come up: yes, thread-batch-size is somewhat useful.
The problem is that the final snippet to be measured can greatly vary in size,
depending on the unroll factor. And when you are trying to measure many instructions,
e.g. all 20k of them, and end up with ~40k snippets, if each one takes 1MB
(worst case scenario), you already need 40GB RAM.
We don't really know beforehand how much bytes any particular snippet will take,
so having a limit on the available memory to be used seems less feasible.

Do we really need the speed boost ? My worry is that we are making the code more complex for a negligible advantage. I personally have never felt that waiting for a couple of minutes was really an issue, but you might have other use cases in mind ?

TBH I'm not sure it's worth making the code more complex to save a minute on a tool that is supposed to run only once in a while.
That said I may not fully understand your use case. What's your typical use? Why is it important to make it faster?

...
Are you seriously saying that 10x wallclock improvement is negligible?

The reason is always the same - making iterative development cycles faster.
I would just like to point out that the analysis used to be 50x times slower originally.
It was borderline unusable. But sure, one rarely needs it, so perhaps that was fine.
I would like to note that it is still slow[er than needed]. And the linear algebra stuff,
which i'm hoping to revisit, will make it even slower.

This is exactly the same story here. We are spending massive amounts of time,
and not doing anything useful with it, when we could be doing something useful there,
like, dunno, making more measurements. Why do we want it to be slow?
For me that long runtime has historically been the pain point.

If that was a joke, it wasn't a funny one i'm afraid.

Additionally, i would like to note a well-known fact: exegesis makes *very* liberal use of randomness.
Personally, i've found that a single measurement run (a single llvm-exegesis invocation) is not quite useful,
and practically always do several runs, and concatenate the yamls. That e.g. filters out some of
the "oh but why is this instruction's schedule suddenly not match measurements?" So there goes some of that
performance headroom. But sure, everyone has 10 hours to wait while sleep(100); a+=1; sleep(100) finishes...

ping.
Is there some concrete concern? If there isn't it can't be addressed.

All of this is pretty much entirely an idiomatic boilerplate.
I understand that parallelizm may be hard to understand a first few times,
but there even isn't any syncronization going on here.

Are you seriously saying that 10x wallclock improvement is negligible?

10x speed improvement is not negligible. I'm simply questioning whether speed matters in this case.

If the speed improvement came at no cost, that would be a no-brainer. But there is a speed/readability tradeoff, which we need to evaluate.

In my personal experience, I did not feel that benchmarking speed was ever an issue, but I do feel that this code is more complex to understand. Therefore to me that tradeoff is negative. If other people feel otherwise I'm happy to reconsider.

Has anyone got a recent profile of llvm-exegesis upto --benchmark-phase=assemble-measured-code?

lebedev.ri added a comment.EditedJan 16 2023, 7:13 AM

Are you seriously saying that 10x wallclock improvement is negligible?

10x speed improvement is not negligible. I'm simply questioning whether speed matters in this case.

If the speed improvement came at no cost, that would be a no-brainer. But there is a speed/readability tradeoff, which we need to evaluate.

All the changes here are to a single function,
that isn't really going to change further anyway.
It's not like this requires changes to many places.

In my personal experience, I did not feel that benchmarking speed was ever an issue, but I do feel that this code is more complex to understand. Therefore to me that tradeoff is negative. If other people feel otherwise I'm happy to reconsider.

I would not bother with this if if i didn't find the existing speed to be problematic.
I would not add progress meter either. I would not fix the analysis speed either.
I'm doing these things because i found them to be sub-par, during my usage.

Has anyone got a recent profile of llvm-exegesis upto --benchmark-phase=assemble-measured-code?

for all-opcode pass,

  • --benchmark-phase=prepare-snippet is instantenious, takes less than a second
  • --benchmark-phase=prepare-and-assemble-snippet takes maybe 2..5 seconds
  • --benchmark-phase=assemble-measured-code takes minutes.

Are you seriously saying that 10x wallclock improvement is negligible?

10x speed improvement is not negligible. I'm simply questioning whether speed matters in this case.

If the speed improvement came at no cost, that would be a no-brainer. But there is a speed/readability tradeoff, which we need to evaluate.

All the changes here are to a single function,
that isn't really going to change further anyway.
It's not like this requires changes to many places.

I don't think number of touched functions is an agree-upon measure of complexity :) I would even argue that splitting in more functions might make the code less complex.

But again, my point is more about usefulness to users, and right now it looks like the change leaves both reviewers (that happen to be users) asking about complexity/usefulness ratio. But both these users also might tend to have the same usage of the tool, so maybe we're not seeing the whole picture - maybe ask for more opinions ?

Has anyone got a recent profile of llvm-exegesis upto --benchmark-phase=assemble-measured-code?

for all-opcode pass,

  • --benchmark-phase=prepare-snippet is instantenious, takes less than a second
  • --benchmark-phase=prepare-and-assemble-snippet takes maybe 2..5 seconds
  • --benchmark-phase=assemble-measured-code takes minutes.

I meant an actual profile - not just a timing - to see where the cycles are going - my hope is that there will be a series of minor changes we can make instead of going down the multi-threading path

Has anyone got a recent profile of llvm-exegesis upto --benchmark-phase=assemble-measured-code?

for all-opcode pass,

  • --benchmark-phase=prepare-snippet is instantenious, takes less than a second
  • --benchmark-phase=prepare-and-assemble-snippet takes maybe 2..5 seconds
  • --benchmark-phase=assemble-measured-code takes minutes.

I meant an actual profile - not just a timing - to see where the cycles are going - my hope is that there will be a series of minor changes we can make instead of going down the multi-threading path

The actual Codegen is known to be a compile time hog.
There is practically no way this result can be replicated otherwise.

Are you seriously saying that 10x wallclock improvement is negligible?

10x speed improvement is not negligible. I'm simply questioning whether speed matters in this case.

If the speed improvement came at no cost, that would be a no-brainer. But there is a speed/readability tradeoff, which we need to evaluate.

All the changes here are to a single function,
that isn't really going to change further anyway.
It's not like this requires changes to many places.

I don't think number of touched functions is an agree-upon measure of complexity :) I would even argue that splitting in more functions might make the code less complex.

But again, my point is more about usefulness to users, and right now it looks like the change leaves both reviewers (that happen to be users) asking about complexity/usefulness ratio. But both these users also might tend to have the same usage of the tool, so maybe we're not seeing the whole picture - maybe ask for more opinions ?

Right. Well, as a user, i do think the existing performance is not great.
More seriously, i think you would be in a better position to ask users, because i don't know any (other than the ones already here)

Has anyone got a recent profile of llvm-exegesis upto --benchmark-phase=assemble-measured-code?

for all-opcode pass,

  • --benchmark-phase=prepare-snippet is instantenious, takes less than a second
  • --benchmark-phase=prepare-and-assemble-snippet takes maybe 2..5 seconds
  • --benchmark-phase=assemble-measured-code takes minutes.

I meant an actual profile - not just a timing - to see where the cycles are going - my hope is that there will be a series of minor changes we can make instead of going down the multi-threading path

See https://discourse.llvm.org/t/does-anyone-use-llvm-exegesis-feedback-wanted/67729/5?u=lebedevri

gchatelet added inline comments.Jan 23 2023, 2:03 AM
llvm/tools/llvm-exegesis/llvm-exegesis.cpp
249–250

Does this bring anything to the user?
Maybe "The number of threads to use for parallel operations (default = 0 (autodetect))"

256–259

"The batch size for parallel operations as it is not efficient to run one task per thread (default = 0 (autodetect))" ?

524

I think we can drop the exegesis namespace qualifier here.

532–537

It is not clear to me what you're trying to achieve here.
I would introduce a function and assign the variable once instead of mutating it in place. The function name may also help understand what the intent is.

536

Can we introduce functions to cut on the nesting level?

Also I believe function will largely improve readability.

while (!Configurations.empty()) {
  // setup PerConfigRCs
  computeBatch(PerConfigRCs);
  runBatch(PerConfigRCs);
  PerConfigRCs.clear();
}
lebedev.ri marked 5 inline comments as done.
lebedev.ri edited the summary of this revision. (Show Details)

@gchatelet thank you for taking a look!
Is this better?

llvm/tools/llvm-exegesis/llvm-exegesis.cpp
524

Right. This kind of thing is probably happening in a few other places.

Simplify loops in computeBatch().

While there, extract runOneConfiguration() out of runBatch().

Much better, thx.
I've added a few more comments.

llvm/tools/llvm-exegesis/llvm-exegesis.cpp
408–411

Can you try with structured binding here (and below)?

419

typo

419

typo

419

no whitespace since we're talking about the type

440

remove

444

ditto

464

Can we reverse this condition and return early to save on the nesting level?