This is an archive of the discontinued LLVM Phabricator instance.

[llvm-exegesis] Post-processing for chained instrs in latency mode (PR41275)
Needs ReviewPublic

Authored by lebedev.ri on Mar 29 2019, 10:52 AM.

Details

Summary

Ok, so this turned out to be easier than i expected.
Also, i initially thought that other modes might need this post-processing,
but i'm not sure which opcodes are affected there, if any.

The results look much better.
On BdVer2 this exposes at least one stable sched cluster
that has inconsistent values from from the measurements,
and a dozen or so somewhat-unstable clusters that also are inconsistent.

Resolves(?) PR41275

Diff Detail

Event Timeline

lebedev.ri created this revision.Mar 29 2019, 10:52 AM

Regarding remaining noise for these chained instrs, i'm guessing we
also need to account for some other latencies, e.g. domain crossing?

Refactor code a little

lebedev.ri edited the summary of this revision. (Show Details)

While there, also model domain transfer delays.

I'm not sure about the first instruction though,

vpextrb	$1, %xmm2, %edi
vpinsrb	$1, %edi, %xmm7, %xmm2
vpextrb	$1, %xmm2, %edi
vpinsrb	$1, %edi, %xmm7, %xmm2

don't we go fpu->int->fpu ?
Shouldn't we be also modelling the fpu2int delays?

Also, still seeing some weird noise in some of these chained instructions.

Also, only post-process 2-instruction benchmarks that were serialized
by llvm-exegesis itself, not just every n-instruction benchmarks.

I'll let gchatelet@ review that one as he started looking at doing that some time ago.

It doesn't feel like the right approach to me.
llvm-exegesis is about relying on measurement to deduce informations. Here you're using a priori knowledge (SchedClass) which may be wrong.
To me, a better approach would be to read all the experiments, create the dependency graph between the 2-instructions snippets and solve a system of equations to recover the per instruction latency, then use the analyzer on the result.

test/tools/llvm-exegesis/X86/analysis-latency-instruction-chaining.test
13 ↗(On Diff #192977)

In this paragraph it is unclear what next instruction refers to.

Maybe rephrase to something like "By constructions, snippet's instructions execution never overlaps. As a consequence the per-snippet latency is the sum of the latencies of the instructions in the Snippet. For instance, in the following example latency(BT32rr R11D R11D) + latency(RCR8rCL R11B R11B) = 12
"

45 ↗(On Diff #192977)

Just to be sure, you crafted this InstructionBenchmark for the sake of testing right?
I don't see how the current code generator can come up with three instructions.

tools/llvm-exegesis/lib/PostProcessing.h
11

Maybe explain what post processing does. As such it is not too informative.

lebedev.ri marked 2 inline comments as done.Apr 2 2019, 6:54 AM

It doesn't feel like the right approach to me.

llvm-exegesis is about relying on measurement to deduce informations. Here you're using a priori knowledge (SchedClass) which may be wrong.

Yes.

To me, a better approach would be to read all the experiments, create the dependency graph between the 2-instructions snippets and solve a system of equations to recover the per instruction latency, then use the analyzer on the result.

Can you explain that in a bit more detail? Something like

lat(i_0) = m_0
sum(lat(i_t)+lat(i_0)) = m_1
lat(i_1) = m_2
sum(lat(i_t)+lat(i_1)) = m_3
...
lat(i_n) => ?
sum(lat(i_t)+lat(i_n)) = m_n
lat(i_t) => ?

Do you suggest to take known lat(i_0)..lat(i_n) from measurements too?
How will that scheme will account for domain transfer delays?

test/tools/llvm-exegesis/X86/analysis-latency-instruction-chaining.test
45 ↗(On Diff #192977)

Yes, this is just a copy-paste.
As you can see, assembled_snippet is identical to the previous benchmark.

To me, a better approach would be to read all the experiments, create the dependency graph between the 2-instructions snippets and solve a system of equations to recover the per instruction latency, then use the analyzer on the result.

Can you explain that in a bit more detail? Something like

lat(i_0) = m_0
sum(lat(i_t)+lat(i_0)) = m_1
lat(i_1) = m_2
sum(lat(i_t)+lat(i_1)) = m_3
...
lat(i_n) => ?
sum(lat(i_t)+lat(i_n)) = m_n
lat(i_t) => ?

Do you suggest to take known lat(i_0)..lat(i_n) from measurements too?

Yes.
Measurement ought to be coherent for runs on the same CPU so with enough data the resulting linear system will be over constrained and is solvable using ordinary least square (https://en.wikipedia.org/wiki/Ordinary_least_squares).

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

lebedev.ri marked an inline comment as done.Apr 3 2019, 3:03 PM

To me, a better approach would be to read all the experiments, create the dependency graph between the 2-instructions snippets and solve a system of equations to recover the per instruction latency, then use the analyzer on the result.

Can you explain that in a bit more detail? Something like

lat(i_0) = m_0
sum(lat(i_t)+lat(i_0)) = m_1
lat(i_1) = m_2
sum(lat(i_t)+lat(i_1)) = m_3
...
lat(i_n) => ?
sum(lat(i_t)+lat(i_n)) = m_n
lat(i_t) => ?

Do you suggest to take known lat(i_0)..lat(i_n) from measurements too?

Yes.
Measurement ought to be coherent for runs on the same CPU so with enough data the resulting linear system will be over constrained and is solvable using ordinary least square (https://en.wikipedia.org/wiki/Ordinary_least_squares).

Okay, sounds sane.
Intermediate issue to solve: creating all these 2-instr chained configs must then also
create configs to measure the params of that second instr (without going into endless loop).

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

Hmm, i don't mean to mock, but it sounds kinda hand-wavy/arbitrary.
We don't want to use latencies of instructions specified in scheduler profile, but at the same
time we are ok with expecting that the sched profile explains all the domain transfer delays.
They should probably also be variables, not hardcoded. But i admit i have not thought that part through.

In the same thoughtflow, would be great if it could try to magically deduce the actual Units (*not* just pressure distribution)

Intermediate issue to solve: creating all these 2-instr chained configs must then also
create configs to measure the params of that second instr (without going into endless loop).

Yes it's more work and potentially really long runtimes. The randomization part is here because the original design was to run llvm-exegesis on many machines and aggregate the runs to do the analysis. The more data the better so we can test our hypotheses.

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

Hmm, i don't mean to mock

Then don't. The mere fact you're mentioning it is already suspicious :)

but it sounds kinda hand-wavy/arbitrary.

I offered three paths to explore, I don't know yet if they work, nor which one is best.
I'd need to dedicate some time to this but I don't have much right now TBH.

We don't want to use latencies of instructions specified in scheduler profile, but at the same
time we are ok with expecting that the sched profile explains all the domain transfer delays.

We want the tool to be as generic as possible to target other platforms but some decisions are to be target specific (e.g. the stack based registers of x87)
This is why llvm-exegesis relies on ExegesisTarget in llvm-exegesis/lib/Target.h.
As such it makes sense that the measurement tool is customized based on what we know or suspect about the target.
On the contrary it is a desirable property that the analysis tool to not depend on target knowledge as much.

They should probably also be variables, not hardcoded. But i admit i have not thought that part through.

This was one of the suggestions I made (DTD being variables), I agree this needs to be thought through.

In the same thoughtflow, would be great if it could try to magically deduce the actual Units (*not* just pressure distribution)

That would be nice indeed although nothing obvious comes to mind. We only have Hardware performance counter and code Snippets to recover knowledge.

Intermediate issue to solve: creating all these 2-instr chained configs must then also
create configs to measure the params of that second instr (without going into endless loop).

Yes it's more work and potentially really long runtimes. The randomization part is here because the original design was to run llvm-exegesis on many machines and aggregate the runs to do the analysis. The more data the better so we can test our hypotheses.

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

Hmm, i don't mean to mock

Then don't. The mere fact you're mentioning it is already suspicious :)

:)

but it sounds kinda hand-wavy/arbitrary.

I offered three paths to explore, I don't know yet if they work, nor which one is best.
I'd need to dedicate some time to this but I don't have much right now TBH.

We don't want to use latencies of instructions specified in scheduler profile, but at the same
time we are ok with expecting that the sched profile explains all the domain transfer delays.

We want the tool to be as generic as possible to target other platforms but some decisions are to be target specific (e.g. the stack based registers of x87)
This is why llvm-exegesis relies on ExegesisTarget in llvm-exegesis/lib/Target.h.
As such it makes sense that the measurement tool is customized based on what we know or suspect about the target.
On the contrary it is a desirable property that the analysis tool to not depend on target knowledge as much.

What i'm saying is that it is pretty much known as a fact that LLVM (at least on X86?) has (very?) partial modelling
of these extra delays, so if we take as granted that all of them are already modelled, the post-processing
may produce rather misleading results.

They should probably also be variables, not hardcoded. But i admit i have not thought that part through.

This was one of the suggestions I made (DTD being variables), I agree this needs to be thought through.

ack

In the same thoughtflow, would be great if it could try to magically deduce the actual Units (*not* just pressure distribution)

That would be nice indeed although nothing obvious comes to mind. We only have Hardware performance counter and code Snippets to recover knowledge.

ack

Intermediate issue to solve: creating all these 2-instr chained configs must then also
create configs to measure the params of that second instr (without going into endless loop).

Yes it's more work and potentially really long runtimes. The randomization part is here because the original design was to run llvm-exegesis on many machines and aggregate the runs to do the analysis. The more data the better so we can test our hypotheses.

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

Hmm, i don't mean to mock

Then don't. The mere fact you're mentioning it is already suspicious :)

but it sounds kinda hand-wavy/arbitrary.

I offered three paths to explore, I don't know yet if they work, nor which one is best.
I'd need to dedicate some time to this but I don't have much right now TBH.

Any kind of approximate time estimate on this?
Next month? This release cycle? Next year? "some day, once there is time"?

@courbet @gchatelet bump. any plans on working on that functionality anytime soon? :)
Not sure how healthy for any project such silence in form of neither working on
nor allowing other interested parties to work on missing functionality...

@courbet @gchatelet bump. any plans on working on that functionality anytime soon? :)
Not sure how healthy for any project such silence in form of neither working on
nor allowing other interested parties to work on missing functionality...

Yes I'll be working on it from now on. Stay tuned.

@courbet @gchatelet bump. any plans on working on that functionality anytime soon? :)
Not sure how healthy for any project such silence in form of neither working on
nor allowing other interested parties to work on missing functionality...

Yes I'll be working on it from now on. Stay tuned.

I've been side tracked but this is still on my radar.

lebedev.ri changed the repository for this revision from rL LLVM to rG LLVM Github Monorepo.
lebedev.ri added a subscriber: fhahn.

Now that i'm renewly motivated to finally get this fixed, let's try again :)

This implements linear algebra support library (i don't think we want to pull in eigen do we?),
implements ordinary/weighted least squares, and uses it to post-process latency benchmark points.

This kinda just works:

vs
But i've made two observations:

  1. if we measured a single instruction, we can say that the measurement is precise, so it's estimator should deviate from the actual measurement by as little as possible => weighted least squares
  2. i was really hoping i wouldn't have to deal with domain transfer delays, but the results are kinda bogus right now :) That, or there is some other issue.

Actually drop pre-monorepo diff parts.

Some early comments on the linear algebra library.
Let me sync with @courbet first.

llvm/include/llvm/Support/LinearAlgebra.h
31

static_assert to make sure that std::is_base_of<T, Derived>::value==true

34

fix

38

you may want to provide a function to factor the cast operation in.

38–39

These could be properties rows() columns().

61–65

fix

67–68

ditto - properties

83

MatrixTy

84

AFAIU from the implementation, this really is a TransposedMatrixView. I think it's important to highlight because m should outlive the TransposedMatrix object.

86

matrix

86

Using a reference instead of a pointer prevents rebinding of this object (can't move, can't reassign). Is it a design decision?

98

This is still a View AFAICT

141–142

Wrong comment

150

const

198

This is hard to read, do you mind introducing a variable for the expected value?

210

const

221

const here and below

353

const here and below

Some early comments on the linear algebra library.
Let me sync with @courbet first.

I'm still in early experimentation stages with this, this is not the code i'd post normally, i only posted just to avoid concurrent implementations.
I'm still getting really weird latencies, and just subtracting forwarding delays from sched model doesn't help.
This may mean that either the math is wrong, or there are many more forwarding delays that we don't model.

Looks like we end up recovering a forwarding delay of -0.04.
It should be ~-2.

Hm, one thing i should/could try here, is to ignore benchmarks that may have forwarding delays.
It won't help with unmodelled forwarding delays (we don't model IVec->FVec delays do we?),
but maybe it will be good-enough for some sequences..

Matt added a subscriber: Matt.Jul 27 2021, 5:23 AM