Page MenuHomePhabricator

[llvm-exegesis] Loop unrolling for loop snippet repetitor mode
ClosedPublic

Authored by lebedev.ri on May 14 2021, 12:08 PM.

Details

Summary

I really needed this, like, factually, yesterday.

Consider the following example:

$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=duplicate
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-4a7e50.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VPXORYrr YMM0 YMM0 YMM0'
  config:          ''
  register_initial_values: []
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
  - { key: inverse_throughput, value: 0.31025, per_snippet_value: 0.31025 }
error:           ''
info:            ''
assembled_snippet: C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C3
...

What does it tell us?
So wait, it can only execute ~3 x86 AVX YMM PXOR zero-idioms per cycle?
That doesn't seem right. That's even less than there are pipes supporting this type of op.

Now, second example:

$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-2418b5.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VPXORYrr YMM0 YMM0 YMM0'
  config:          ''
  register_initial_values: []
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
  - { key: inverse_throughput, value: 1.00011, per_snippet_value: 1.00011 }
error:           ''
info:            ''
assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3
...

Now that's just worse. Due to the looping, the throughput completely plummeted,
and now we can only do a single instruction/cycle!?

That's not great.
And final example:

$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop --loop-body-size=1000
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c402e2.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VPXORYrr YMM0 YMM0 YMM0'
  config:          ''
  register_initial_values: []
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
  - { key: inverse_throughput, value: 0.167087, per_snippet_value: 0.167087 }
error:           ''
info:            ''
assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3
...

So if we merge the previous two approaches, do duplicate this single-instruction snippet 1000x
(loop-body-size/instruction count in snippet), and run a loop with 1000 iterations
over that duplicated/unrolled snippet, the measured throughput goes through the roof,
up to 5.9 instructions/cycle, which finally tells us that this idiom is zero-cycle!

Diff Detail

Event Timeline

lebedev.ri created this revision.May 14 2021, 12:08 PM
lebedev.ri requested review of this revision.May 14 2021, 12:08 PM

Thinking about it more, i wonder if this really should be unroll factor,
or much like how the repetition count works, maybe this should specify
the desired loop body size?

lebedev.ri edited the summary of this revision. (Show Details)
lebedev.ri edited the summary of this revision. (Show Details)

Proofread the docs some more.

Cool, thanks for the change. I like the approach, only have minor comments.

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
171–172

[style] This is no longer a constant, you can remove the k.

172

ditto

llvm/tools/llvm-exegesis/lib/SnippetRepetitor.h
43

Adding the LoopBodySize here sort of breaks the SnippetRepetitor abstraction. I think LoopBodySize should be a member of LoopSnippetRepetitor, initialized in the constructor.

@courbet thank you for taking a look!
Partially address review notes.

@courbet ping
Argh, i've again forgot to submit inline comment :(

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
172

So if i pass LoopBodySize into constructor of LoopSnippetRepetitor,
how can i adjust it here then?

courbet accepted this revision.May 24 2021, 11:35 PM
courbet added inline comments.
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
172

Right, I don't have a good suggestion... Let's keep it like this.

This revision is now accepted and ready to land.May 24 2021, 11:35 PM

@courbet thank you for the review!

While there, i have a question: if i wanted to make exegesis automatically search for dep-breaking idioms
(an instruction that has at least two aliasing uses), where would i best put it? llvm-exegesis/X86/target.cpp?
I guess, it would not run for all the instructions, but only a predefined set (xor, sub-like, ???)

This revision was landed with ongoing or failed builds.May 25 2021, 2:09 AM
This revision was automatically updated to reflect the committed changes.