This is an archive of the discontinued LLVM Phabricator instance.

[llvm-exegesis] When generating templates with chained instructions, also add templates for helper instructions
Needs ReviewPublic

Authored by lebedev.ri on Apr 8 2019, 2:18 AM.

Details

Reviewers
courbet
gchatelet
Summary

To measure characteristics of instruction Instr, we sometimes need to
chain it together with some other instructions. But that gives us
collective characteristics of those instructions combined.

Initially, in D60000, i have proposed to recover the actual characteristics
of the actual target instruction by using the LLVM scheduling data of
the helper instructions. But it was pointed out that we should not depend
on a-priori data, and use measurements only.

But for that, we need to not only measure the target instruction,
but also measure the instructions that were chained to the target instruction.

I don't believe those measurements should be done by hand.
If i want to measure latency of instruction that can't be executed serially,
at most, i'm willing to run analysis mode on the *automated* measurements,
i don't really want to look what instructions llvm-exegesis has used to
serialize execution. But maybe that is just me?

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Apr 8 2019, 2:18 AM

To be noted this won't quite help the -mode=uops, since there it's the same opcode + randomness:

$ ./bin/llvm-exegesis -mode=uops -opcode-name=CMOV16rr -benchmarks-file=-
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-46b66a.o
---
mode:            uops
key:             
  instructions:    
    - 'CMOV16rr AX AX R12W i_0x2'
    - 'CMOV16rr BP BP DX i_0xe'
    - 'CMOV16rr BX BX R10W i_0x3'
    - 'CMOV16rr CX CX BX i_0x1'
    - 'CMOV16rr DI DI R13W i_0x4'
    - 'CMOV16rr DX DX R14W i_0xc'
    - 'CMOV16rr SI SI R10W i_0x9'
    - 'CMOV16rr R8W R8W SI i_0x2'
    - 'CMOV16rr R9W R9W DI i_0x6'
    - 'CMOV16rr R10W R10W BP i_0xc'
    - 'CMOV16rr R11W R11W BX i_0xd'
    - 'CMOV16rr R12W R12W R15W i_0xe'
    - 'CMOV16rr R13W R13W DI i_0xc'
    - 'CMOV16rr R14W R14W R14W i_0xf'
    - 'CMOV16rr R15W R15W R13W i_0x4'
  config:          ''
  register_initial_values: 
    - 'AX=0x0'
    - 'R12W=0x0'
    - 'EFLAGS=0x0'
    - 'BP=0x0'
    - 'DX=0x0'
    - 'BX=0x0'
    - 'R10W=0x0'
    - 'CX=0x0'
    - 'DI=0x0'
    - 'R13W=0x0'
    - 'R14W=0x0'
    - 'SI=0x0'
    - 'R8W=0x0'
    - 'R9W=0x0'
    - 'R11W=0x0'
    - 'R15W=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:    
  - { key: PdFPU0, value: 0, per_snippet_value: 0 }
  - { key: PdFPU1, value: 0, per_snippet_value: 0 }
  - { key: PdFPU2, value: 0, per_snippet_value: 0 }
  - { key: PdFPU3, value: 0, per_snippet_value: 0 }
  - { key: NumMicroOps, value: 1.0135, per_snippet_value: 15.2025 }
error:           ''
info:            instruction has tied variables, using static renaming.
assembled_snippet: 5541574156415541545366B800006641BC00004883EC08C7042400000000C7442404000000009D66BD000066BA000066BB00006641BA000066B9000066BF00006641BD00006641BE000066BE00006641B800006641B900006641BB00006641BF000066410F42C4660F4EEA66410F43DA660F41CB66410F44FD66410F4CD666410F49F266440F42C666440F46CF66440F4CD566440F4DDB66450F4EE766440F4CEF66450F4FF666450F44FD66410F42C45B415C415D415E415F5DC3
...

I understand the motivation behind this change but I think we need a more principled approach. I'll try to sum up my reasoning here.
To me there are two useful modes for llvm-exegesis:

  • We want to look at a particular instruction latency - useful when we want to quickly check an assumption,
  • We want to get the latency for all the instructions - useful to fully characterize or check a processor.

The analysis tool only makes sense for the second bullet.

For Latency analysis, an instruction falls in one of the following benchmarking modes:

  1. Infeasible (privileged instructions, inadequate control flow e.g HLT)
  2. Measurable in isolation
  3. Measurable through another instruction

For 2, we want to explore all the dimensions of the instruction:

  • Impact of choosing several time the same register (XOR EAX, EAX, EAX) or different ones (XOR EAX, EBX, ECX)
  • Impact of choosing special values for immediates (IMUL EAX, EAX, 0 or special values for floating point numbers sNaN qNaN ±∞ ±0 normal and denormal)

For 3, this is 2 combined with a second instruction. The paired instruction is another dimension to explore. Because we're mostly interested in the behavior of the first instruction, we don't need to explore all of this dimension. We can restrict ourselves to the compatible instructions with the less degrees of freedom.

For this exploration to be efficient (manageable) we can't eagerly generate all of the templates. We need a preprocessing step to gather which instructions belong to 2 or 3 - and for the ones in 3 which set of instructions is worth considering, then generate a dependency graph and process the instructions in an order that would allow to deduce the latencies for the instructions in 3, but still exploring the dimensions for 2.

In this automated mode (second bullet) the values for instructions in 3 can be recovered by solving an Ordinary Least Square. The recovered measurements can then be processed by the analysis tool.
I've started working on this, I just need to dedicate more time to it.

tools/llvm-exegesis/lib/SnippetGenerator.cpp
106

Can't you just:

std::move(Templates.begin(), Templates.end(), std::back_inserter(FinalTemplates)); )
unittests/tools/llvm-exegesis/X86/SnippetGeneratorTest.cpp
199

SETCCrExhaustive

213

I don't understand this second line.

224

available

lebedev.ri marked an inline comment as done.Apr 11 2019, 4:50 AM
lebedev.ri added inline comments.
unittests/tools/llvm-exegesis/X86/SnippetGeneratorTest.cpp
213

Line 2 says that while PrimaryInstruction1 is the primary instruction in that snippet,
that PrimaryInstruction1 is the same instruction as SecondaryInstruction0.

...

So i agree in general, but i don't think i agree with fine print.

What if i don't want to validate the whole entirety of the instructions,
but only one of these instructions that is Measurable through another instruction?
I simply can't?

I agree with the general direction of to fully characterize or check a processor.,
but i'm not fully sure yet how that will be used/useful for actual schedule profiles.

I guess it's also a question of "is already useful and getting useful every day"
vs. "useful and will be more useful some time in the future",
i.e. the rate of improvement.

Just 2 cent.

...

So i agree in general, but i don't think i agree with fine print.

What if i don't want to validate the whole entirety of the instructions,
but only one of these instructions that is Measurable through another instruction?
I simply can't?

Well yes that would still be possible but you'd still need to go through the preprocessing / build the dependency graph for the subset of interest. The exploration would be restricted to the instruction + dependent instructions.
There will be a startup cost for this strategy but it will take much less time than testing everything.

If the startup cost is too big there are a few possibilities we can investigate:

  • use a cache to store the precomputed data.
  • have the precomputation done during llvm-exegesis build.

Would that work for you?

...

So i agree in general, but i don't think i agree with fine print.

What if i don't want to validate the whole entirety of the instructions,
but only one of these instructions that is Measurable through another instruction?
I simply can't?

Well yes that would still be possible but you'd still need to go through the preprocessing / build the dependency graph for the subset of interest. The exploration would be restricted to the instruction + dependent instructions.
There will be a startup cost for this strategy but it will take much less time than testing everything.

If the startup cost is too big there are a few possibilities we can investigate:

  • use a cache to store the precomputed data.
  • have the precomputation done during llvm-exegesis build.

Would that work for you?

I think we are talking about *essentially* the same thing, just from
different perspectives so it looks like we are talking about different things.

*Of course* it will still need preprocessing, there will just be no way around that
(well, that is if we do not want to use apriori data).

What i'm saying is, i wouldn't want to run the entire exhaustive benchmark over all the
instructions just so that i can look at a single instruction that needs post-processing.

I'd want to run a *subset* of the benchmarks, that would only explore the target instruction,
plus some extra instructions to 'replace' apriori data.
Which is roughly what this patch does.

So from where i'm sitting, the other exploration is a generalization of *this*.
Maybe i'm missing the point?