This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Support/
-
llvm/
-
Support/
17
LinearAlgebra.h
-
lib/
-
MC/
-
MCSchedule.cpp
-
Target/X86/
-
X86/
-
X86ScheduleZnver3.td
-
test/tools/llvm-exegesis/X86/
-
tools/
-
llvm-exegesis/
-
X86/
-
analysis-latency-instruction-chaining-domain-transfer.test
-
analysis-latency-instruction-chaining.test
-
tools/llvm-exegesis/
-
llvm-exegesis/
-
lib/
-
Analysis.h
-
Analysis.cpp
-
CMakeLists.txt
-
PostProcessing.h
-
PostProcessing.cpp
-
llvm-exegesis.cpp
-
unittests/Support/
-
Support/
-
LinearAlgebraTest.cpp

Differential D60000

[llvm-exegesis] Post-processing for chained instrs in latency mode (PR41275)
Needs ReviewPublic

Authored by lebedev.ri on Mar 29 2019, 10:52 AM.

Download Raw Diff

Details

Reviewers

courbet
gchatelet
andreadb

Summary

Ok, so this turned out to be easier than i expected.
Also, i initially thought that other modes might need this post-processing,
but i'm not sure which opcodes are affected there, if any.

The results look much better.
On BdVer2 this exposes at least one stable sched cluster
that has inconsistent values from from the measurements,
and a dozen or so somewhat-unstable clusters that also are inconsistent.

latency-clusters-stable.html165 KBDownload

latency-clusters-unstable.html1 MBDownload

Resolves(?) PR41275

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lebedev.ri created this revision.Mar 29 2019, 10:52 AM

Herald added subscribers: jdoerfert, tschuett, mgorny. · View Herald TranscriptMar 29 2019, 10:52 AM

Regarding remaining noise for these chained instrs, i'm guessing we
also need to account for some other latencies, e.g. domain crossing?

Refactor code a little

While there, also model domain transfer delays.

I'm not sure about the first instruction though,

vpextrb	$1, %xmm2, %edi
vpinsrb	$1, %edi, %xmm7, %xmm2
vpextrb	$1, %xmm2, %edi
vpinsrb	$1, %edi, %xmm7, %xmm2

don't we go fpu->int->fpu ?
Shouldn't we be also modelling the fpu2int delays?

Also, still seeing some weird noise in some of these chained instructions.

lebedev.ri added a reviewer: andreadb.Mar 30 2019, 7:19 AM

Also, only post-process 2-instruction benchmarks that were serialized
by llvm-exegesis itself, not just every n-instruction benchmarks.

I'll let gchatelet@ review that one as he started looking at doing that some time ago.

It doesn't feel like the right approach to me.
llvm-exegesis is about relying on measurement to deduce informations. Here you're using a priori knowledge (SchedClass) which may be wrong.
To me, a better approach would be to read all the experiments, create the dependency graph between the 2-instructions snippets and solve a system of equations to recover the per instruction latency, then use the analyzer on the result.

test/tools/llvm-exegesis/X86/analysis-latency-instruction-chaining.test
13 ↗	(On Diff #192977)	In this paragraph it is unclear what `next instruction` refers to. Maybe rephrase to something like "By constructions, snippet's instructions execution never overlaps. As a consequence the `per-snippet latency` is the sum of the latencies of the instructions in the Snippet. For instance, in the following example latency(BT32rr R11D R11D) + latency(RCR8rCL R11B R11B) = 12 "
45 ↗	(On Diff #192977)	Just to be sure, you crafted this InstructionBenchmark for the sake of testing right? I don't see how the current code generator can come up with three instructions.
tools/llvm-exegesis/lib/PostProcessing.h
10 ↗	(On Diff #192977)	Maybe explain what post processing does. As such it is not too informative.

In D60000#1451313, @gchatelet wrote:

It doesn't feel like the right approach to me.

In D60000#1451313, @gchatelet wrote:

llvm-exegesis is about relying on measurement to deduce informations. Here you're using a priori knowledge (SchedClass) which may be wrong.

Yes.

To me, a better approach would be to read all the experiments, create the dependency graph between the 2-instructions snippets and solve a system of equations to recover the per instruction latency, then use the analyzer on the result.

Can you explain that in a bit more detail? Something like

lat(i_0) = m_0
sum(lat(i_t)+lat(i_0)) = m_1
lat(i_1) = m_2
sum(lat(i_t)+lat(i_1)) = m_3
...
lat(i_n) => ?
sum(lat(i_t)+lat(i_n)) = m_n
lat(i_t) => ?

Do you suggest to take known lat(i_0)..lat(i_n) from measurements too?
How will that scheme will account for domain transfer delays?

test/tools/llvm-exegesis/X86/analysis-latency-instruction-chaining.test
45 ↗	(On Diff #192977)	Yes, this is just a copy-paste. As you can see, `assembled_snippet` is identical to the previous benchmark.

To me, a better approach would be to read all the experiments, create the dependency graph between the 2-instructions snippets and solve a system of equations to recover the per instruction latency, then use the analyzer on the result.

Can you explain that in a bit more detail? Something like
lat(i_0) = m_0
sum(lat(i_t)+lat(i_0)) = m_1
lat(i_1) = m_2
sum(lat(i_t)+lat(i_1)) = m_3
...
lat(i_n) => ?
sum(lat(i_t)+lat(i_n)) = m_n
lat(i_t) => ?
Do you suggest to take known lat(i_0)..lat(i_n) from measurements too?

Yes.
Measurement ought to be coherent for runs on the same CPU so with enough data the resulting linear system will be over constrained and is solvable using ordinary least square (https://en.wikipedia.org/wiki/Ordinary_least_squares).

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

In D60000#1452962, @gchatelet wrote:
To me, a better approach would be to read all the experiments, create the dependency graph between the 2-instructions snippets and solve a system of equations to recover the per instruction latency, then use the analyzer on the result.

Can you explain that in a bit more detail? Something like
lat(i_0) = m_0
sum(lat(i_t)+lat(i_0)) = m_1
lat(i_1) = m_2
sum(lat(i_t)+lat(i_1)) = m_3
...
lat(i_n) => ?
sum(lat(i_t)+lat(i_n)) = m_n
lat(i_t) => ?
Do you suggest to take known lat(i_0)..lat(i_n) from measurements too?
Yes.
Measurement ought to be coherent for runs on the same CPU so with enough data the resulting linear system will be over constrained and is solvable using ordinary least square (https://en.wikipedia.org/wiki/Ordinary_least_squares).

Okay, sounds sane.
Intermediate issue to solve: creating all these 2-instr chained configs must then also
create configs to measure the params of that second instr (without going into endless loop).

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

Hmm, i don't mean to mock, but it sounds kinda hand-wavy/arbitrary.
We don't want to use latencies of instructions specified in scheduler profile, but at the same
time we are ok with expecting that the sched profile explains all the domain transfer delays.
They should probably also be variables, not hardcoded. But i admit i have not thought that part through.

In the same thoughtflow, would be great if it could try to magically deduce the actual Units (*not* just pressure distribution)

Intermediate issue to solve: creating all these 2-instr chained configs must then also
create configs to measure the params of that second instr (without going into endless loop).

Yes it's more work and potentially really long runtimes. The randomization part is here because the original design was to run llvm-exegesis on many machines and aggregate the runs to do the analysis. The more data the better so we can test our hypotheses.

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

Hmm, i don't mean to mock

Then don't. The mere fact you're mentioning it is already suspicious :)

but it sounds kinda hand-wavy/arbitrary.

I offered three paths to explore, I don't know yet if they work, nor which one is best.
I'd need to dedicate some time to this but I don't have much right now TBH.

We don't want to use latencies of instructions specified in scheduler profile, but at the same
time we are ok with expecting that the sched profile explains all the domain transfer delays.

We want the tool to be as generic as possible to target other platforms but some decisions are to be target specific (e.g. the stack based registers of x87)
This is why llvm-exegesis relies on ExegesisTarget in llvm-exegesis/lib/Target.h.
As such it makes sense that the measurement tool is customized based on what we know or suspect about the target.
On the contrary it is a desirable property that the analysis tool to not depend on target knowledge as much.

They should probably also be variables, not hardcoded. But i admit i have not thought that part through.

This was one of the suggestions I made (DTD being variables), I agree this needs to be thought through.

In the same thoughtflow, would be great if it could try to magically deduce the actual Units (*not* just pressure distribution)

That would be nice indeed although nothing obvious comes to mind. We only have Hardware performance counter and code Snippets to recover knowledge.

In D60000#1454661, @gchatelet wrote:

Intermediate issue to solve: creating all these 2-instr chained configs must then also
create configs to measure the params of that second instr (without going into endless loop).

Yes it's more work and potentially really long runtimes. The randomization part is here because the original design was to run llvm-exegesis on many machines and aggregate the runs to do the analysis. The more data the better so we can test our hypotheses.

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

Hmm, i don't mean to mock

Then don't. The mere fact you're mentioning it is already suspicious :)

but it sounds kinda hand-wavy/arbitrary.

I offered three paths to explore, I don't know yet if they work, nor which one is best.
I'd need to dedicate some time to this but I don't have much right now TBH.

We don't want to use latencies of instructions specified in scheduler profile, but at the same
time we are ok with expecting that the sched profile explains all the domain transfer delays.

We want the tool to be as generic as possible to target other platforms but some decisions are to be target specific (e.g. the stack based registers of x87)
This is why llvm-exegesis relies on ExegesisTarget in llvm-exegesis/lib/Target.h.
As such it makes sense that the measurement tool is customized based on what we know or suspect about the target.
On the contrary it is a desirable property that the analysis tool to not depend on target knowledge as much.

What i'm saying is that it is pretty much known as a fact that LLVM (at least on X86?) has (very?) partial modelling
of these extra delays, so if we take as granted that all of them are already modelled, the post-processing
may produce rather misleading results.

They should probably also be variables, not hardcoded. But i admit i have not thought that part through.

This was one of the suggestions I made (DTD being variables), I agree this needs to be thought through.

ack

In the same thoughtflow, would be great if it could try to magically deduce the actual Units (*not* just pressure distribution)

That would be nice indeed although nothing obvious comes to mind. We only have Hardware performance counter and code Snippets to recover knowledge.

ack

lebedev.ri mentioned this in D60401: [llvm-exegesis] When generating templates with chained instructions, also add templates for helper instructions.Apr 8 2019, 2:18 AM

In D60000#1454661, @gchatelet wrote:

Intermediate issue to solve: creating all these 2-instr chained configs must then also
create configs to measure the params of that second instr (without going into endless loop).

Yes it's more work and potentially really long runtimes. The randomization part is here because the original design was to run llvm-exegesis on many machines and aggregate the runs to do the analysis. The more data the better so we can test our hypotheses.

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

Hmm, i don't mean to mock

Then don't. The mere fact you're mentioning it is already suspicious :)

but it sounds kinda hand-wavy/arbitrary.

I offered three paths to explore, I don't know yet if they work, nor which one is best.
I'd need to dedicate some time to this but I don't have much right now TBH.

Any kind of approximate time estimate on this?
Next month? This release cycle? Next year? "some day, once there is time"?

@courbet @gchatelet bump. any plans on working on that functionality anytime soon? :)
Not sure how healthy for any project such silence in form of neither working on
nor allowing other interested parties to work on missing functionality...

In D60000#1785210, @lebedev.ri wrote:

@courbet @gchatelet bump. any plans on working on that functionality anytime soon? :)
Not sure how healthy for any project such silence in form of neither working on
nor allowing other interested parties to work on missing functionality...

Yes I'll be working on it from now on. Stay tuned.

mstojanovic added a subscriber: mstojanovic.Dec 18 2019, 10:04 AM

In D60000#1785687, @gchatelet wrote:

In D60000#1785210, @lebedev.ri wrote:

@courbet @gchatelet bump. any plans on working on that functionality anytime soon? :)
Not sure how healthy for any project such silence in form of neither working on
nor allowing other interested parties to work on missing functionality...

Yes I'll be working on it from now on. Stay tuned.

I've been side tracked but this is still on my radar.

lebedev.ri mentioned this in D74156: [llvm-exegesis] Exploring X86::OperandType::OPERAND_COND_CODE.Feb 7 2020, 9:50 AM

lebedev.ri mentioned this in D75510: [X86][llvm-exegesis] Exploring vector insert/extract.Mar 3 2020, 2:26 AM

Now that i'm renewly motivated to finally get this fixed, let's try again :)

This implements linear algebra support library (i don't think we want to pull in eigen do we?),
implements ordinary/weighted least squares, and uses it to post-process latency benchmark points.

This kinda just works:

latency-clusters-ols.html198 KBDownload

latency-clusters.html151 KBDownload

But i've made two observations:

if we measured a single instruction, we can say that the measurement is precise, so it's estimator should deviate from the actual measurement by as little as possible => weighted least squares
i was really hoping i wouldn't have to deal with domain transfer delays, but the results are kinda bogus right now :) That, or there is some other issue.

Herald added a subscriber: dexonsmith. · View Herald TranscriptMay 4 2021, 12:28 PM

Actually drop pre-monorepo diff parts.

Some early comments on the linear algebra library.
Let me sync with @courbet first.

llvm/include/llvm/Support/LinearAlgebra.h
30	`static_assert` to make sure that `std::is_base_of<T, Derived>::value==true`
33	fix
37	you may want to provide a function to factor the cast operation in.
37–38	These could be properties `rows()` `columns()`.
60–64	fix
66–67	ditto - properties
82	`MatrixTy`
83	AFAIU from the implementation, this really is a `TransposedMatrixView`. I think it's important to highlight because `m` should outlive the `TransposedMatrix` object.
85	`matrix`
85	Using a reference instead of a pointer prevents rebinding of this object (can't move, can't reassign). Is it a design decision?
97	This is still a `View` AFAICT
140–141	Wrong comment
149	`const`
197	This is hard to read, do you mind introducing a variable for the expected value?
209	`const`
220	`const` here and below
352	`const` here and below

In D60000#2738598, @gchatelet wrote:

Some early comments on the linear algebra library.
Let me sync with @courbet first.

I'm still in early experimentation stages with this, this is not the code i'd post normally, i only posted just to avoid concurrent implementations.
I'm still getting really weird latencies, and just subtracting forwarding delays from sched model doesn't help.
This may mean that either the math is wrong, or there are many more forwarding delays that we don't model.

lebedev.ri mentioned this in D94395: [X86] AMD Zen 3 Scheduler Model.May 14 2021, 10:26 AM

Looks like we end up recovering a forwarding delay of -0.04.
It should be ~-2.

Herald added subscribers: pengfei, hiraditya. · View Herald TranscriptMay 16 2021, 2:31 PM

Harbormaster completed remote builds in B104724: Diff 345728.May 16 2021, 2:31 PM

Hm, one thing i should/could try here, is to ignore benchmarks that may have forwarding delays.
It won't help with unmodelled forwarding delays (we don't model IVec->FVec delays do we?),
but maybe it will be good-enough for some sequences..

Matt added a subscriber: Matt.Jul 27 2021, 5:23 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Support/

LinearAlgebra.h

60 lines

lib/

MC/

MCSchedule.cpp

1 line

Target/

X86/

X86ScheduleZnver3.td

2 lines

test/

tools/

llvm-exegesis/

X86/

analysis-latency-instruction-chaining-domain-transfer.test

52 lines

analysis-latency-instruction-chaining.test

57 lines

tools/

llvm-exegesis/

lib/

1 line

8 lines

1 line

31 lines

178 lines

16 lines

unittests/

Support/

LinearAlgebraTest.cpp

4 lines

Diff 345728

llvm/include/llvm/Support/LinearAlgebra.h

Show All 21 Lines

namespace llvm {		namespace llvm {
namespace linearalgebra {		namespace linearalgebra {

///--------------------------------------------------------------------------///		///--------------------------------------------------------------------------///

template <typename, typename> class TransposedMatrix;		template <typename, typename> class TransposedMatrix;

template <typename T, typename Derived> struct MatrixInterface {		template <typename T, typename Derived> struct MatrixInterface {
		gchateletUnsubmitted Not Done Reply Inline Actions `static_assert` to make sure that `std::is_base_of<T, Derived>::value==true` gchatelet: `static_assert` to make sure that `std::is_base_of<T, Derived>::value==true`
using value_type = T;		using value_type = T;

// MatrixInterface() = delete;		// MatrixInterface() = delete;
		gchateletUnsubmitted Not Done Reply Inline Actions fix gchatelet: fix
MatrixInterface &operator=(MatrixInterface) = delete;		MatrixInterface &operator=(MatrixInterface) = delete;
MatrixInterface &operator=(const MatrixInterface &) = delete;		MatrixInterface &operator=(const MatrixInterface &) = delete;

int getNumRows() const { return ((const Derived *)this)->getNumRows(); }		int getNumRows() const { return ((const Derived *)this)->getNumRows(); }
		gchateletUnsubmitted Not Done Reply Inline Actions you may want to provide a function to factor the cast operation in. gchatelet: you may want to provide a function to factor the cast operation in.
int getNumColumns() const { return ((const Derived *)this)->getNumColumns(); }		int getNumColumns() const { return ((const Derived *)this)->getNumColumns(); }
		gchateletUnsubmitted Not Done Reply Inline Actions These could be properties `rows()` `columns()`. gchatelet: These could be properties `rows()` `columns()`.

T &operator()(int row, int col) {		T &operator()(int row, int col) {
return ((Derived *)this)->operator()(row, col);		return ((Derived *)this)->operator()(row, col);
}		}

/// X^T		/// X^T
TransposedMatrix<T, Derived> getTransposedMatrix() {		TransposedMatrix<T, Derived> getTransposedMatrix() {
return TransposedMatrix<T, Derived>(((Derived )this));		return TransposedMatrix<T, Derived>(((Derived )this));
}		}
};		};

///--------------------------------------------------------------------------///		///--------------------------------------------------------------------------///

template <typename T = double>		template <typename T = double>
class Matrix : public MatrixInterface<T, Matrix<T>> {		class Matrix : public MatrixInterface<T, Matrix<T>> {
public:		public:
Matrix(int num_rows, int num_cols) : num_rows(num_rows), num_cols(num_cols) {		Matrix(int num_rows, int num_cols) : num_rows(num_rows), num_cols(num_cols) {
storage.resize(num_rows * num_cols);		storage.resize(num_rows * num_cols);
}		}

// Disallow copying.		// Disallow copying.
// Matrix(const Matrix &) = delete;		// Matrix(const Matrix &) = delete;
// Matrix(Matrix &&) = delete;		// Matrix(Matrix &&) = delete;
// Matrix &operator=(Matrix) = delete;		// Matrix &operator=(Matrix) = delete;
// Matrix &operator=(const Matrix &) = delete;		// Matrix &operator=(const Matrix &) = delete;
// Matrix &operator=(Matrix &&) = delete;		// Matrix &operator=(Matrix &&) = delete;
		gchateletUnsubmitted Not Done Reply Inline Actions fix gchatelet: fix

int getNumRows() const { return num_rows; };		int getNumRows() const { return num_rows; };
int getNumColumns() const { return num_cols; };		int getNumColumns() const { return num_cols; };
		gchateletUnsubmitted Not Done Reply Inline Actions ditto - properties gchatelet: ditto - properties

T &operator()(int row, int col) {		T &operator()(int row, int col) {
assert(row >= 0 && row < num_rows);		assert(row >= 0 && row < num_rows);
assert(col >= 0 && col < num_cols);		assert(col >= 0 && col < num_cols);

return storage[num_cols * row + col];		return storage[num_cols * row + col];
}		}

private:		private:
std::vector<T> storage;		std::vector<T> storage;
int num_rows;		int num_rows;
int num_cols;		int num_cols;
};		};

template <typename T, typename InnerTy>		template <typename T, typename InnerTy>
		gchateletUnsubmitted Not Done Reply Inline Actions `MatrixTy` gchatelet: `MatrixTy`
class TransposedMatrix		class TransposedMatrix
		gchateletUnsubmitted Not Done Reply Inline Actions AFAIU from the implementation, this really is a `TransposedMatrixView`. I think it's important to highlight because `m` should outlive the `TransposedMatrix` object. gchatelet: AFAIU from the implementation, this really is a `TransposedMatrixView`. I think it's important…
: public MatrixInterface<T, TransposedMatrix<T, InnerTy>> {		: public MatrixInterface<T, TransposedMatrix<T, InnerTy>> {
InnerTy &m;		InnerTy &m;
		gchateletUnsubmitted Not Done Reply Inline Actions `matrix` gchatelet: `matrix`
		gchateletUnsubmitted Not Done Reply Inline Actions Using a reference instead of a pointer prevents rebinding of this object (can't move, can't reassign). Is it a design decision? gchatelet: Using a reference instead of a pointer prevents rebinding of this object (can't move, can't…

public:		public:
TransposedMatrix(InnerTy &m) : m(m) {}		TransposedMatrix(InnerTy &m) : m(m) {}

int getNumRows() const { return m.getNumColumns(); };		int getNumRows() const { return m.getNumColumns(); };
int getNumColumns() const { return m.getNumRows(); };		int getNumColumns() const { return m.getNumRows(); };

T &operator()(int row, int col) { return m(col, row); }		T &operator()(int row, int col) { return m(col, row); }
};		};

template <typename T, typename LHSTy, typename RHSTy>		template <typename T, typename LHSTy, typename RHSTy>
class AugmentedMatrix		class AugmentedMatrix
		gchateletUnsubmitted Not Done Reply Inline Actions This is still a `View` AFAICT gchatelet: This is still a `View` AFAICT
: public MatrixInterface<T, AugmentedMatrix<T, LHSTy, RHSTy>> {		: public MatrixInterface<T, AugmentedMatrix<T, LHSTy, RHSTy>> {
MatrixInterface<T, LHSTy> &a;		MatrixInterface<T, LHSTy> &a;
MatrixInterface<T, RHSTy> &b;		MatrixInterface<T, RHSTy> &b;

public:		public:
AugmentedMatrix(MatrixInterface<T, LHSTy> &a, MatrixInterface<T, RHSTy> &b)		AugmentedMatrix(MatrixInterface<T, LHSTy> &a, MatrixInterface<T, RHSTy> &b)
: a(a), b(b) {		: a(a), b(b) {
assert(a.getNumRows() == b.getNumRows());		assert(a.getNumRows() == b.getNumRows());
Show All 26 Lines	for (int col = 0; col != LHS.getNumColumns(); ++col) {
m(row, col) = LHS(row, col);		m(row, col) = LHS(row, col);
}		}
}		}
return m;		return m;
}		}

///--------------------------------------------------------------------------///		///--------------------------------------------------------------------------///

/// Get a square matrix with all elements being zero except the elements		/// Get a square matrix with all elements being zero except the elements
/// on the diagonal, which are ones.		/// on the diagonal, which are ones.
		gchateletUnsubmitted Not Done Reply Inline Actions Wrong comment gchatelet: Wrong comment
template <typename T, typename LHSTy>		template <typename T, typename LHSTy>
bool isSquareMatrix(const MatrixInterface<T, LHSTy> &LHS) {		bool isSquareMatrix(const MatrixInterface<T, LHSTy> &LHS) {
return LHS.getNumRows() == LHS.getNumColumns();		return LHS.getNumRows() == LHS.getNumColumns();
}		}

/// Are all the entries below the main diagonal are zero?		/// Are all the entries below the main diagonal are zero?
template <typename T, typename LHSTy>		template <typename T, typename LHSTy>
bool isUpperTriangularMatrix(MatrixInterface<T, LHSTy> &LHS) {		bool isUpperTriangularMatrix(MatrixInterface<T, LHSTy> &LHS) {
		gchateletUnsubmitted Not Done Reply Inline Actions `const` gchatelet: `const`
for (int row = 0; row != LHS.getNumRows(); ++row) {		for (int row = 0; row != LHS.getNumRows(); ++row) {
for (int col = 0; col != LHS.getNumColumns(); ++col) {		for (int col = 0; col != LHS.getNumColumns(); ++col) {
// Is this element on the diagonal, or to the right of diagonal?		// Is this element on the diagonal, or to the right of diagonal?
if (col >= row)		if (col >= row)
continue;		continue;
if (LHS(row, col) != 0)		if (LHS(row, col) != 0)
return false;		return false;
}		}
}		}
return true;		return true;
}		}

///--------------------------------------------------------------------------///		///--------------------------------------------------------------------------///

/// Get a square matrix with all elements being zero except the elements		/// Get a square matrix with all elements being zero except the elements
/// on the diagonal, which are ones.		/// on the diagonal, which are ones.
template <typename T> Matrix<T> getIdentityMatrix(int Size) {		template <typename T> Matrix<T> getIdentityMatrix(int Size) {
Matrix<T> m(Size, Size);		Matrix<T> m(Size, Size);
for (int diagEltIdx = 0; diagEltIdx != Size; ++diagEltIdx)		for (int diagEltIdx = 0; diagEltIdx != Size; ++diagEltIdx)
m(diagEltIdx, diagEltIdx) = 1;		m(diagEltIdx, diagEltIdx) = 1;
return m;		return m;
}		}

		/// Are all elements of this matrix, except the ones on the diagonal, a zeros?
		template <typename T, typename LHSTy>
		bool isDiagonalMatrix(MatrixInterface<T, LHSTy> &LHS) {
		if (!isSquareMatrix(LHS))
		return false;
		for (int row = 0; row != LHS.getNumRows(); ++row) {
		for (int col = 0; col != LHS.getNumColumns(); ++col) {
		if (col == row)
		continue;
		if (LHS(row, col) != 0)
		return false;
		}
		}
		return true;
		}

/// Are all elements of this matrix zeros, except the elements on the main		/// Are all elements of this matrix zeros, except the elements on the main
/// diagonal, which are ones?		/// diagonal, which are ones?
template <typename T, typename LHSTy>		template <typename T, typename LHSTy>
bool isIdentityMatrix(MatrixInterface<T, LHSTy> &LHS) {		bool isIdentityMatrix(MatrixInterface<T, LHSTy> &LHS) {
if (!isSquareMatrix(LHS))		if (!isDiagonalMatrix(LHS))
return false;		return false;
for (int row = 0; row != LHS.getNumRows(); ++row) {		for (int row = 0; row != LHS.getNumRows(); ++row) {
for (int col = 0; col != LHS.getNumColumns(); ++col) {		for (int col = 0; col != LHS.getNumColumns(); ++col) {
if (LHS(row, col) != (col == row) ? 1 : 0)		if (LHS(row, col) != (col == row) ? 1 : 0)
		gchateletUnsubmitted Not Done Reply Inline Actions This is hard to read, do you mind introducing a variable for the expected value? gchatelet: This is hard to read, do you mind introducing a variable for the expected value?
return false;		return false;
}		}
}		}
return true;		return true;
}		}

///--------------------------------------------------------------------------///		///--------------------------------------------------------------------------///

/// On row \p row, which column is the first one to have a non-zero value?		/// On row \p row, which column is the first one to have a non-zero value?
/// Returns number of columns if no such element exists.		/// Returns number of columns if no such element exists.
template <typename T, typename LHSTy>		template <typename T, typename LHSTy>
int getLeadingCoeffientColumn(MatrixInterface<T, LHSTy> &LHS, int row) {		int getLeadingCoeffientColumn(MatrixInterface<T, LHSTy> &LHS, int row) {
		gchateletUnsubmitted Not Done Reply Inline Actions `const` gchatelet: `const`
int col = 0;		int col = 0;
for (; col != LHS.getNumColumns(); ++col) {		for (; col != LHS.getNumColumns(); ++col) {
if (LHS(row, col) != 0)		if (LHS(row, col) != 0)
break;		break;
}		}
return col;		return col;
}		}

/// Are all elements of the row \p row zeros?		/// Are all elements of the row \p row zeros?
template <typename T, typename LHSTy>		template <typename T, typename LHSTy>
int isZeroRow(MatrixInterface<T, LHSTy> &LHS, int row) {		int isZeroRow(MatrixInterface<T, LHSTy> &LHS, int row) {
		gchateletUnsubmitted Not Done Reply Inline Actions `const` here and below gchatelet: `const` here and below
return getLeadingCoeffientColumn(LHS, row) == LHS.getNumColumns();		return getLeadingCoeffientColumn(LHS, row) == LHS.getNumColumns();
}		}

template <typename T, typename LHSTy>		template <typename T, typename LHSTy>
bool isMatrixInRowEchelonForm(MatrixInterface<T, LHSTy> &LHS) {		bool isMatrixInRowEchelonForm(MatrixInterface<T, LHSTy> &LHS) {
assert(isUpperTriangularMatrix(LHS));		assert(isUpperTriangularMatrix(LHS));

bool seenZeroRow = false;		bool seenZeroRow = false;
▲ Show 20 Lines • Show All 93 Lines • ▼ Show 20 Lines	for (int pivot = 0; pivot != LHS.getNumRows(); ++pivot) {
int pivotRow = findPivotingRow(LHS, column, pivot);		int pivotRow = findPivotingRow(LHS, column, pivot);

if (pivotRow != pivot) {		if (pivotRow != pivot) {
swapRows(LHS, pivotRow, pivot);		swapRows(LHS, pivotRow, pivot);
pivotRow = pivot;		pivotRow = pivot;
}		}

T &pivotElement = LHS(pivotRow, column);		T &pivotElement = LHS(pivotRow, column);
if (pivotElement == 0)		assert(pivotElement != 0);
continue;

divideRow(LHS, pivotRow, pivotElement);		divideRow(LHS, pivotRow, pivotElement);
pivotElement = 1.0; // Account for floating point rounding issues.		pivotElement = 1.0; // Account for floating point rounding issues.

for (int row = 0; row != LHS.getNumRows(); ++row) {		for (int row = 0; row != LHS.getNumRows(); ++row) {
if (row == pivotRow)		if (row == pivotRow)
continue;		continue;

T &currElement = LHS(row, column);		T &currElement = LHS(row, column);
if (currElement == 0)		if (currElement == 0)
continue;		continue;

subtractRowMultiple(LHS, row, pivotRow, currElement);		subtractRowMultiple(LHS, row, pivotRow, currElement);
currElement = 0; // Account for floating point rounding issues.		currElement = 0; // Account for floating point rounding issues.
}		}
}		}
assert(isMatrixInReducedRowEchelonForm(LHS));		assert(isMatrixInReducedRowEchelonForm(LHS));
}		}

/// X^-1		/// X^-1
template <typename T, typename LHSTy>		template <typename T, typename LHSTy>
Matrix<T> getInverseMatrix(MatrixInterface<T, LHSTy> &LHS) {		Matrix<T> getInverseMatrix(MatrixInterface<T, LHSTy> &LHS) {
		gchateletUnsubmitted Not Done Reply Inline Actions `const` here and below gchatelet: `const` here and below
assert(isSquareMatrix(LHS));		assert(isSquareMatrix(LHS));
Matrix<T> A = cloneMatrix<T>(LHS);		Matrix<T> A = cloneMatrix<T>(LHS);
Matrix<T> B = getIdentityMatrix<T>(LHS.getNumRows());		Matrix<T> B = getIdentityMatrix<T>(LHS.getNumRows());
auto AB = getAugmentedMatrix(A, B);		auto AB = getAugmentedMatrix(A, B);
performGaussJordanElimination(AB);		performGaussJordanElimination(AB);
assert(isIdentityMatrix(A));		assert(isIdentityMatrix(A));
return B;		return B;
}		}
Show All 37 Lines	Matrix<T> getMomentMatrix(MatrixInterface<T, LHSTy> &LHS,
auto LHSTransposed = LHS.getTransposedMatrix();		auto LHSTransposed = LHS.getTransposedMatrix();
return LHSTransposed * RHS;		return LHSTransposed * RHS;
}		}

///--------------------------------------------------------------------------///		///--------------------------------------------------------------------------///

/// (X^T * X)^-1 * X^T * y		/// (X^T * X)^-1 * X^T * y
template <typename T, typename LHSTy, typename RHSTy>		template <typename T, typename LHSTy, typename RHSTy>
Matrix<T> getOrdinaryLeastSquaresEstimation(MatrixInterface<T, LHSTy> &LHS,		Matrix<T> getOrdinaryLeastSquaresEstimator(MatrixInterface<T, LHSTy> &XMat,
MatrixInterface<T, RHSTy> &RHS) {		MatrixInterface<T, RHSTy> &YVec) {
assert(LHS.getNumRows() >= LHS.getNumColumns());		assert(XMat.getNumRows() >= XMat.getNumColumns());
assert(LHS.getNumRows() == RHS.getNumRows());		assert(XMat.getNumRows() == YVec.getNumRows());
assert(RHS.getNumColumns() == 1);		assert(YVec.getNumColumns() == 1);

		auto XMatTransposed = XMat.getTransposedMatrix();
		Matrix<T> XMatNormal = XMatTransposed * XMat;
		Matrix<T> XMatNormalInverse = getInverseMatrix<T>(XMatNormal);
		Matrix<T> XYMoment = XMatTransposed * YVec;
		return XMatNormalInverse * XYMoment;
		}

auto LHSNormal = getNormalMatrix<T>(LHS);		/// (X^T * W * X)^-1 * X^T * W * y
auto LHSNormalInverse = getInverseMatrix<T>(LHSNormal);		template <typename T, typename LHSTy, typename RHSTy>
auto LHSRHSMoment = getMomentMatrix<T>(LHS, RHS);		Matrix<T> getWeightedLeastSquaresEstimator(MatrixInterface<T, LHSTy> &XMat,
return LHSNormalInverse * LHSRHSMoment;		MatrixInterface<T, RHSTy> &YVec,
		MatrixInterface<T, RHSTy> &WMat) {
		assert(XMat.getNumRows() >= XMat.getNumColumns());
		assert(isDiagonalMatrix(WMat));
		assert(XMat.getNumRows() == WMat.getNumRows());
		assert(XMat.getNumRows() == YVec.getNumRows());
		assert(YVec.getNumColumns() == 1);

		auto XMatTransposed = XMat.getTransposedMatrix();
		Matrix<T> XMatTransposedWeighted = XMatTransposed * WMat;
		Matrix<T> XMatTransposedWeightedNormal = XMatTransposedWeighted * XMat;
		Matrix<T> XMatTransposedWeightedNormalInverse =
		getInverseMatrix<T>(XMatTransposedWeightedNormal);
		Matrix<T> XYMomentWeighted = XMatTransposedWeighted * YVec;
		return XMatTransposedWeightedNormalInverse * XYMomentWeighted;
}		}

///--------------------------------------------------------------------------///		///--------------------------------------------------------------------------///

} // namespace linearalgebra		} // namespace linearalgebra
} // namespace llvm		} // namespace llvm

#endif		#endif

llvm/lib/MC/MCSchedule.cpp

Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines	MCSchedModel::getForwardingDelayCycles(ArrayRef<MCReadAdvanceEntry> Entries,
unsigned WriteResourceID) {		unsigned WriteResourceID) {
if (Entries.empty())		if (Entries.empty())
return 0;		return 0;

int DelayCycles = 0;		int DelayCycles = 0;
for (const MCReadAdvanceEntry &E : Entries) {		for (const MCReadAdvanceEntry &E : Entries) {
if (E.WriteResourceID != WriteResourceID)		if (E.WriteResourceID != WriteResourceID)
continue;		continue;
		llvm::errs() << "&E = " << &E << ", cycles = " << E.Cycles << "\n";
DelayCycles = std::min(DelayCycles, E.Cycles);		DelayCycles = std::min(DelayCycles, E.Cycles);
}		}

return std::abs(DelayCycles);		return std::abs(DelayCycles);
}		}

llvm/lib/Target/X86/X86ScheduleZnver3.td

	Show First 20 Lines • Show All 476 Lines • ▼ Show 20 Lines

	def : ReadAdvance<ReadAfterVecLd, Znver3Model.VecLoadLatency>;			def : ReadAdvance<ReadAfterVecLd, Znver3Model.VecLoadLatency>;
	def : ReadAdvance<ReadAfterVecXLd, Znver3Model.VecLoadLatency>;			def : ReadAdvance<ReadAfterVecXLd, Znver3Model.VecLoadLatency>;
	def : ReadAdvance<ReadAfterVecYLd, Znver3Model.VecLoadLatency>;			def : ReadAdvance<ReadAfterVecYLd, Znver3Model.VecLoadLatency>;

	// AMD SOG 19h, 2.11 Floating-Point Unit			// AMD SOG 19h, 2.11 Floating-Point Unit
	// There is 1 cycle of added latency for a result to cross			// There is 1 cycle of added latency for a result to cross
	// from F to I or I to F domain.			// from F to I or I to F domain.
	def : ReadAdvance<ReadInt2Fpu, -1>;			def : ReadAdvance<ReadInt2Fpu, -42>;

	// Instructions with both a load and a store folded are modeled as a folded			// Instructions with both a load and a store folded are modeled as a folded
	// load + WriteRMW.			// load + WriteRMW.
	defm : Zn3WriteResInt<WriteRMW, [Zn3AGU012, Zn3Store], Znver3Model.StoreLatency, [1, 1], 0>;			defm : Zn3WriteResInt<WriteRMW, [Zn3AGU012, Zn3Store], Znver3Model.StoreLatency, [1, 1], 0>;

	// Loads, stores, and moves, not folded with other operations.			// Loads, stores, and moves, not folded with other operations.
	defm : Zn3WriteResInt<WriteLoad, [Zn3AGU012, Zn3Load], !add(Znver3Model.LoadLatency, 1), [1, 1], 1>;			defm : Zn3WriteResInt<WriteLoad, [Zn3AGU012, Zn3Load], !add(Znver3Model.LoadLatency, 1), [1, 1], 1>;

	▲ Show 20 Lines • Show All 1,169 Lines • Show Last 20 Lines

llvm/test/tools/llvm-exegesis/X86/analysis-latency-instruction-chaining-domain-transfer.test

This file was added.

				# RUN: llvm-exegesis -mode=analysis -benchmarks-file=%s -analysis-clusters-output-file=- -analysis-clustering-epsilon=0.5 -analysis-inconsistency-epsilon=0.5 -analysis-numpoints=1 \| FileCheck -check-prefixes=CHECK %s

				# CHECK: {{^}}cluster_id,opcode_name,config,sched_class,latency{{$}}

				# CHECK-NEXT: {{^}}0,
				# CHECK-SAME: ,10.08{{$}}

				# CHECK: {{^}}1,
				# CHECK-SAME: ,11.07{{$}}

				# PINSRBrr has latency of 2 cycles. (the value from scheduling profile!)
				# But int to fpu units data transfer causes additional latency of 10 cycles.
				# Thus the actual latency of VPEXTRBrr is 10..11.

				---
				mode: latency
				key:
				instructions:
				- 'VPEXTRBrr R15D XMM3 i_0x1'
				- 'PINSRBrr XMM3 XMM3 R15D i_0x1'
				config: ''
				register_initial_values:
				- 'XMM3=0x0'
				cpu_name: bdver2
				llvm_triple: x86_64-unknown-linux-gnu
				num_repetitions: 10000
				measurements:
				- { key: latency, value: 0.0000, per_snippet_value: 22.0802 }
				error: ''
				info: Repeating two instructions
				assembled_snippet: 41574883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F1C244883C410C4C37914DF0166410F3A20DF01C4C37914DF0166410F3A20DF01C4C37914DF0166410F3A20DF01C4C37914DF0166410F3A20DF01C4C37914DF0166410F3A20DF01C4C37914DF0166410F3A20DF01C4C37914DF0166410F3A20DF01C4C37914DF0166410F3A20DF01415FC3
				...
				---
				mode: latency
				key:
				instructions:
				- 'PEXTRBrr ESI XMM7 i_0x1'
				- 'VCVTSI642SSrr XMM7 XMM12 RSI'
				config: ''
				register_initial_values:
				- 'XMM7=0x0'
				- 'XMM12=0x0'
				- 'RSI=0x0'
				cpu_name: bdver2
				llvm_triple: x86_64-unknown-linux-gnu
				num_repetitions: 10000
				measurements:
				- { key: latency, value: 12.533, per_snippet_value: 25.066 }
				error: ''
				info: Repeating two instructions
				assembled_snippet: 4883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C5FA6F3C244883C4104883EC10C7042400000000C744240400000000C744240800000000C744240C00000000C57A6F24244883C41048BE0000000000000000660F3A14FE01C4E19A2AFE660F3A14FE01C4E19A2AFE660F3A14FE01C4E19A2AFE660F3A14FE01C4E19A2AFE660F3A14FE01C4E19A2AFE660F3A14FE01C4E19A2AFE660F3A14FE01C4E19A2AFE660F3A14FE01C4E19A2AFEC3
				...

llvm/test/tools/llvm-exegesis/X86/analysis-latency-instruction-chaining.test

This file was added.

				# RUN: llvm-exegesis -mode=analysis -benchmarks-file=%s -analysis-clusters-output-file=- -analysis-clustering-epsilon=0.5 -analysis-inconsistency-epsilon=0.5 -analysis-numpoints=1 \| FileCheck -check-prefixes=CHECK %s

				# CHECK: {{^}}cluster_id,opcode_name,config,sched_class,latency{{$}}

				# CHECK-NEXT: {{^}}0,
				# CHECK-SAME: ,1.00{{$}}
				# CHECK-NEXT: {{^}}0,
				# CHECK-SAME: ,1.00{{$}}

				# Instructions were executed serially, meaning that the next instruction
				# ONLY starts executing when the current instruction finishes.
				# Thus, the real latency of the first instruction is the per_snippet_value minus
				# the sum of latencies of all the other instructions in the snippet.

				# RCR8rCL has latency of 11. (the value from scheduling profile!)
				# Latency of whole snipped is 12 or 23. (not measured, hand-written.)
				# Thus, latency of BT32rr is 12-11 = 1, or 23-11-11 = 1

				---
				mode: latency
				key:
				instructions:
				- 'BT32rr R11D R11D'
				- 'RCR8rCL R11B R11B'
				config: ''
				register_initial_values:
				- 'R11D=0x0'
				- 'R11B=0x0'
				- 'CL=0x0'
				cpu_name: bdver2
				llvm_triple: x86_64-unknown-linux-gnu
				num_repetitions: 10000
				measurements:
				- { key: latency, value: 50.0000, per_snippet_value: 100.0000 }
				error: ''
				info: Repeating two instructions
				assembled_snippet: 41BB0000000041B300B100450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DBC3
				...
				---
				mode: latency
				key:
				instructions:
				- 'RCR8rCL R11B R11B'
				config: ''
				register_initial_values:
				- 'R11D=0x0'
				- 'R11B=0x0'
				- 'CL=0x0'
				cpu_name: bdver2
				llvm_triple: x86_64-unknown-linux-gnu
				num_repetitions: 10000
				measurements:
				- { key: latency, value: 25.0000, per_snippet_value: 25.0000 }
				error: ''
				info: Repeating two instructions
				assembled_snippet: 41BB0000000041B300B100450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DBC3
				...

llvm/tools/llvm-exegesis/lib/Analysis.h

	Show All 31 Lines

	namespace llvm {			namespace llvm {
	namespace exegesis {			namespace exegesis {

	// A helper class to analyze benchmark results for a target.			// A helper class to analyze benchmark results for a target.
	class Analysis {			class Analysis {
	public:			public:
	Analysis(const Target &Target, std::unique_ptr<MCInstrInfo> InstrInfo,			Analysis(const Target &Target, std::unique_ptr<MCInstrInfo> InstrInfo,
				std::unique_ptr<MCSubtargetInfo> SubtargetInfo,
	const InstructionBenchmarkClustering &Clustering,			const InstructionBenchmarkClustering &Clustering,
	double AnalysisInconsistencyEpsilon,			double AnalysisInconsistencyEpsilon,
	bool AnalysisDisplayUnstableOpcodes,			bool AnalysisDisplayUnstableOpcodes,
	const std::string &ForceCpuName = "");			const std::string &ForceCpuName = "");

	// Prints a csv of instructions for each cluster.			// Prints a csv of instructions for each cluster.
	struct PrintClusters {};			struct PrintClusters {};
	// Find potential errors in the scheduling information given measurements.			// Find potential errors in the scheduling information given measurements.
	▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines

llvm/tools/llvm-exegesis/lib/Analysis.cpp

Show First 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	#endif
for (const auto &Measurement : Point.Measurements) {		for (const auto &Measurement : Point.Measurements) {
OS << kCsvSep;		OS << kCsvSep;
writeMeasurementValue<kEscapeCsv>(OS, Measurement.PerInstructionValue);		writeMeasurementValue<kEscapeCsv>(OS, Measurement.PerInstructionValue);
}		}
OS << "\n";		OS << "\n";
}		}

Analysis::Analysis(const Target &Target, std::unique_ptr<MCInstrInfo> InstrInfo,		Analysis::Analysis(const Target &Target, std::unique_ptr<MCInstrInfo> InstrInfo,
		std::unique_ptr<MCSubtargetInfo> SubtargetInfo,
const InstructionBenchmarkClustering &Clustering,		const InstructionBenchmarkClustering &Clustering,
double AnalysisInconsistencyEpsilon,		double AnalysisInconsistencyEpsilon,
bool AnalysisDisplayUnstableOpcodes,		bool AnalysisDisplayUnstableOpcodes,
const std::string &ForceCpuName)		const std::string &ForceCpuName)
: Clustering_(Clustering), InstrInfo_(std::move(InstrInfo)),		: Clustering_(Clustering), SubtargetInfo_(std::move(SubtargetInfo)),
		InstrInfo_(std::move(InstrInfo)),
AnalysisInconsistencyEpsilonSquared_(AnalysisInconsistencyEpsilon *		AnalysisInconsistencyEpsilonSquared_(AnalysisInconsistencyEpsilon *
AnalysisInconsistencyEpsilon),		AnalysisInconsistencyEpsilon),
AnalysisDisplayUnstableOpcodes_(AnalysisDisplayUnstableOpcodes) {		AnalysisDisplayUnstableOpcodes_(AnalysisDisplayUnstableOpcodes) {
if (Clustering.getPoints().empty())		if (Clustering.getPoints().empty())
return;		return;

const InstructionBenchmark &FirstPoint = Clustering.getPoints().front();		const InstructionBenchmark &FirstPoint = Clustering.getPoints().front();
const std::string CpuName =		const std::string CpuName =
ForceCpuName.empty() ? FirstPoint.CpuName : ForceCpuName;		ForceCpuName.empty() ? FirstPoint.CpuName : ForceCpuName;
RegInfo_.reset(Target.createMCRegInfo(FirstPoint.LLVMTriple));		RegInfo_.reset(Target.createMCRegInfo(FirstPoint.LLVMTriple));
MCTargetOptions MCOptions;		MCTargetOptions MCOptions;
AsmInfo_.reset(		AsmInfo_.reset(
Target.createMCAsmInfo(*RegInfo_, FirstPoint.LLVMTriple, MCOptions));		Target.createMCAsmInfo(*RegInfo_, FirstPoint.LLVMTriple, MCOptions));
SubtargetInfo_.reset(
Target.createMCSubtargetInfo(FirstPoint.LLVMTriple, CpuName, ""));
InstPrinter_.reset(Target.createMCInstPrinter(		InstPrinter_.reset(Target.createMCInstPrinter(
Triple(FirstPoint.LLVMTriple), 0 /default variant/, *AsmInfo_,		Triple(FirstPoint.LLVMTriple), 0 /default variant/, *AsmInfo_,
InstrInfo_, RegInfo_));		InstrInfo_, RegInfo_));

Context_ = std::make_unique<MCContext>(		Context_ = std::make_unique<MCContext>(
Triple(FirstPoint.LLVMTriple), AsmInfo_.get(), RegInfo_.get(),		Triple(FirstPoint.LLVMTriple), AsmInfo_.get(), RegInfo_.get(),
&ObjectFileInfo_, SubtargetInfo_.get());		&ObjectFileInfo_, SubtargetInfo_.get());
Disasm_.reset(Target.createMCDisassembler(SubtargetInfo_, Context_));		Disasm_.reset(Target.createMCDisassembler(SubtargetInfo_, Context_));
▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	case InstructionBenchmark::Latency:
break;		break;
case InstructionBenchmark::Uops:		case InstructionBenchmark::Uops:
case InstructionBenchmark::InverseThroughput:		case InstructionBenchmark::InverseThroughput:
writeParallelSnippetHtml(OS, Point.Key.Instructions, *InstrInfo_);		writeParallelSnippetHtml(OS, Point.Key.Instructions, *InstrInfo_);
break;		break;
default:		default:
llvm_unreachable("invalid mode");		llvm_unreachable("invalid mode");
}		}
		if (Point.Info == "WLS fixpoint" \|\| Point.Info == "WLS reconstruction")
		OS << " <small><i>(" << Point.Info << ")</i></small>";
OS << "</span> <span class=\"mono\">";		OS << "</span> <span class=\"mono\">";
writeEscaped<kEscapeHtml>(OS, Point.Key.Config);		writeEscaped<kEscapeHtml>(OS, Point.Key.Config);
OS << "</span></li>";		OS << "</span></li>";
}		}

void Analysis::printSchedClassClustersHtml(		void Analysis::printSchedClassClustersHtml(
const std::vector<SchedClassCluster> &Clusters,		const std::vector<SchedClassCluster> &Clusters,
const ResolvedSchedClass &RSC, raw_ostream &OS) const {		const ResolvedSchedClass &RSC, raw_ostream &OS) const {
▲ Show 20 Lines • Show All 305 Lines • Show Last 20 Lines

llvm/tools/llvm-exegesis/lib/CMakeLists.txt

Show First 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	add_llvm_library(LLVMExegesis
Clustering.cpp		Clustering.cpp
CodeTemplate.cpp		CodeTemplate.cpp
Error.cpp		Error.cpp
LatencyBenchmarkRunner.cpp		LatencyBenchmarkRunner.cpp
LlvmState.cpp		LlvmState.cpp
MCInstrDescView.cpp		MCInstrDescView.cpp
ParallelSnippetGenerator.cpp		ParallelSnippetGenerator.cpp
PerfHelper.cpp		PerfHelper.cpp
		PostProcessing.cpp
RegisterAliasing.cpp		RegisterAliasing.cpp
RegisterValue.cpp		RegisterValue.cpp
SchedClassResolution.cpp		SchedClassResolution.cpp
SerialSnippetGenerator.cpp		SerialSnippetGenerator.cpp
SnippetFile.cpp		SnippetFile.cpp
SnippetGenerator.cpp		SnippetGenerator.cpp
SnippetRepetitor.cpp		SnippetRepetitor.cpp
Target.cpp		Target.cpp
UopsBenchmarkRunner.cpp		UopsBenchmarkRunner.cpp

LINK_LIBS ${libs}		LINK_LIBS ${libs}

DEPENDS		DEPENDS
intrinsics_gen		intrinsics_gen
)		)

llvm/tools/llvm-exegesis/lib/PostProcessing.h

This file was added.

				//===-- PostProcessing.h ----------------------------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				///
				/// \file
				/// Post-processing for the benchmark points.
				///
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_TOOLS_LLVM_EXEGESIS_POSTPROCESSING_H
				#define LLVM_TOOLS_LLVM_EXEGESIS_POSTPROCESSING_H

				#include "BenchmarkResult.h"
				#include <vector>

				namespace llvm {
				namespace exegesis {

				void PostProcessChainedLatencyBenchmarkPoints(
				std::vector<InstructionBenchmark> &Points,
				const llvm::MCInstrInfo &InstrInfo,
				const llvm::MCSubtargetInfo &SubtargetInfo);

				} // namespace exegesis
				} // namespace llvm

				#endif // LLVM_TOOLS_LLVM_EXEGESIS_POSTPROCESSING_H

llvm/tools/llvm-exegesis/lib/PostProcessing.cpp

This file was added.

				//===-- PostProcessing.cpp --------------------------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "PostProcessing.h"
				#include "Clustering.h"
				#include "SchedClassResolution.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/ADT/SmallSet.h"
				#include "llvm/Support/LinearAlgebra.h"
				#include <utility>

				namespace llvm {
				namespace exegesis {

				static std::pair<double, bool>
				getAdjustedLatency(const InstructionBenchmark *RelevantPoint,
				const llvm::MCInstrInfo &InstrInfo,
				const llvm::MCSubtargetInfo &SubtargetInfo) {
				assert(RelevantPoint->Measurements.size() == 1);
				assert(RelevantPoint->Measurements[0].Key == "latency");
				double Latency = RelevantPoint->Measurements[0].PerSnippetValue;
				bool HaveForwardingDelays = false;

				if (RelevantPoint->Key.Instructions.size() == 1)
				return {Latency, HaveForwardingDelays};

				const MCSchedModel &SM = SubtargetInfo.getSchedModel();
				for (const MCInst &Inst : RelevantPoint->Key.Instructions) {
				const MCInstrDesc &MCDesc = InstrInfo.get(Inst.getOpcode());

				// Obtain the scheduling class information from the instruction.
				unsigned SchedClassID = MCDesc.getSchedClass();
				unsigned CPUID = SM.getProcessorID();

				// Try to solve variant scheduling classes.
				while (SchedClassID && SM.getSchedClassDesc(SchedClassID)->isVariant())
				SchedClassID = SubtargetInfo.resolveVariantSchedClass(SchedClassID, &Inst,
				&InstrInfo, CPUID);

				const MCSchedClassDesc &SCDesc = *SM.getSchedClassDesc(SchedClassID);
				unsigned ForwardingDelayCycles = MCSchedModel::getForwardingDelayCycles(
				SubtargetInfo.getReadAdvanceEntries(SCDesc));
				HaveForwardingDelays \|= ForwardingDelayCycles != 0;
				}

				return {Latency, HaveForwardingDelays};
				}

				void PostProcessChainedLatencyBenchmarkPoints(
				std::vector<InstructionBenchmark> &Points,
				const llvm::MCInstrInfo &InstrInfo,
				const llvm::MCSubtargetInfo &SubtargetInfo) {
				auto IsLatencyPoint = [](const InstructionBenchmark &Point) {
				return Point.Mode == InstructionBenchmark::ModeE::Latency &&
				Point.Error.empty();
				};
				auto IsChainedLatencyPoint =
				[IsLatencyPoint](const InstructionBenchmark &Point) {
				return IsLatencyPoint(Point) && Point.Key.Instructions.size() >= 2;
				};

				unsigned NumOpcodes = InstrInfo.getNumOpcodes();

				// Which opcodes were ever chained in all of the benchmark points?
				// Only record the ones for which we succeeded in measuring latency.
				std::vector<int> OpcodeToIndex(NumOpcodes, /Index=/-1);
				std::vector<MCInst> IndexToOpcode;
				IndexToOpcode.reserve(NumOpcodes);
				for (const InstructionBenchmark &Point :
				make_filter_range(Points, IsChainedLatencyPoint)) {
				for (const MCInst &Instruction : Point.Key.Instructions) {
				const unsigned Opcode = Instruction.getOpcode();
				assert(Opcode < NumOpcodes && "NumOpcodes is incorrect (too small)");
				if (OpcodeToIndex[Opcode] != -1) // Already seen chained?
				continue;
				OpcodeToIndex[Opcode] = IndexToOpcode.size();
				IndexToOpcode.emplace_back(Instruction);
				}
				}

				if (IndexToOpcode.empty())
				return; // Lucky us.

				// Remember all the points that contained any opcode that was ever chained.
				// We can not do this in the previous loop, because if Opc0 was chained with
				// Opc1, we'll miss all standalone points for Opc1 before seeing the chaining.
				// We store indexes into Points to avoid iterator invalidation.
				SmallVector<const InstructionBenchmark *, 64> RelevantPoints;
				for (const InstructionBenchmark &Point :
				make_filter_range(Points, IsLatencyPoint)) {
				if (any_of(Point.Key.Instructions,
				[OpcodeToIndex](const MCInst &Instruction) {
				return OpcodeToIndex[Instruction.getOpcode()] != -1;
				}))
				RelevantPoints.emplace_back(&Point);
				}

				using namespace linearalgebra;

				errs() << "rows = " << RelevantPoints.size() << "\n";
				errs() << "cols = " << IndexToOpcode.size() << "\n";
				Matrix<double> OpcodeChaining(RelevantPoints.size(), IndexToOpcode.size());
				Matrix<double> ForwardingDelayPresence(RelevantPoints.size(), 1);
				Matrix<double> SnippetLatency(RelevantPoints.size(), 1);
				Matrix<double> Weights = getIdentityMatrix<double>(RelevantPoints.size());

				std::vector<char> FixpointOpcodes(NumOpcodes, /IsFixpoint=/false);

				for (auto I : enumerate(RelevantPoints)) {
				int row = I.index();
				const InstructionBenchmark *RelevantPoint = I.value();

				// FIXME: this
				std::tie(SnippetLatency(row, 0), ForwardingDelayPresence(row, 0)) =
				getAdjustedLatency(RelevantPoint, InstrInfo, SubtargetInfo);

				if (RelevantPoint->Key.Instructions.size() == 1) {
				// Give a(n arbitrary) bonus to the points that directly measured
				// one single specific instruction. We believe these measurements
				// to be precise, so the fit should basically not change them.
				Weights(row, row) = 1e+3;
				FixpointOpcodes[RelevantPoint->Key.Instructions.front().getOpcode()] =
				true;
				}

				for (const MCInst &Instrn : RelevantPoint->Key.Instructions) {
				double &Entry = OpcodeChaining(row, OpcodeToIndex[Instrn.getOpcode()]);
				assert(Entry == 0);
				Entry = 1;
				}
				}
				RelevantPoints.clear();
				OpcodeToIndex.clear();

				auto Zz = getAugmentedMatrix(OpcodeChaining, ForwardingDelayPresence);
				auto EstimatedInstructionLatencies =
				getWeightedLeastSquaresEstimator<double>(Zz, SnippetLatency, Weights);
				assert(EstimatedInstructionLatencies.getNumColumns() == 1);
				assert((size_t)EstimatedInstructionLatencies.getNumRows() >=
				IndexToOpcode.size());

				// Points.erase(
				// std::remove_if(Points.begin(), Points.end(), IsChainedLatencyPoint),
				// Points.end());
				Points.clear();

				Points.reserve(Points.size() + IndexToOpcode.size());
				for (auto I : enumerate(IndexToOpcode)) {
				const MCInst &Instruction = I.value();
				const double EstimatedInstructionLatency =
				EstimatedInstructionLatencies(I.index(), 0);

				Points.emplace_back();
				InstructionBenchmark &NewPoint = Points.back();

				NewPoint.Key.Instructions.emplace_back(Instruction);
				NewPoint.Mode = InstructionBenchmark::Latency;
				NewPoint.CpuName = Points.front().CpuName;
				NewPoint.LLVMTriple = Points.front().LLVMTriple;
				NewPoint.Measurements.emplace_back(
				BenchmarkMeasure::Create("latency", EstimatedInstructionLatency));
				NewPoint.Info = FixpointOpcodes[Instruction.getOpcode()]
				? "WLS fixpoint"
				: "WLS reconstruction";
				}
				for (int I = IndexToOpcode.size();
				I < EstimatedInstructionLatencies.getNumRows(); ++I)
				errs() << "Reconstructed forwaring delay = "
				<< EstimatedInstructionLatencies(I, 0) << "\n";
				}

				} // namespace exegesis
				} // namespace llvm

llvm/tools/llvm-exegesis/llvm-exegesis.cpp

Show All 12 Lines

#include "lib/Analysis.h"		#include "lib/Analysis.h"
#include "lib/BenchmarkResult.h"		#include "lib/BenchmarkResult.h"
#include "lib/BenchmarkRunner.h"		#include "lib/BenchmarkRunner.h"
#include "lib/Clustering.h"		#include "lib/Clustering.h"
#include "lib/Error.h"		#include "lib/Error.h"
#include "lib/LlvmState.h"		#include "lib/LlvmState.h"
#include "lib/PerfHelper.h"		#include "lib/PerfHelper.h"
		#include "lib/PostProcessing.h"
#include "lib/SnippetFile.h"		#include "lib/SnippetFile.h"
#include "lib/SnippetRepetitor.h"		#include "lib/SnippetRepetitor.h"
#include "lib/Target.h"		#include "lib/Target.h"
#include "lib/TargetSelect.h"		#include "lib/TargetSelect.h"
#include "llvm/ADT/StringExtras.h"		#include "llvm/ADT/StringExtras.h"
#include "llvm/ADT/Twine.h"		#include "llvm/ADT/Twine.h"
#include "llvm/MC/MCInstBuilder.h"		#include "llvm/MC/MCInstBuilder.h"
#include "llvm/MC/MCObjectFileInfo.h"		#include "llvm/MC/MCObjectFileInfo.h"
▲ Show 20 Lines • Show All 375 Lines • ▼ Show 20 Lines	static void analysisMain() {
}		}

InitializeNativeTarget();		InitializeNativeTarget();
InitializeNativeTargetAsmPrinter();		InitializeNativeTargetAsmPrinter();
InitializeNativeTargetDisassembler();		InitializeNativeTargetDisassembler();

// Read benchmarks.		// Read benchmarks.
const LLVMState State("");		const LLVMState State("");
const std::vector<InstructionBenchmark> Points = ExitOnFileError(		std::vector<InstructionBenchmark> Points = ExitOnFileError(
BenchmarkFile, InstructionBenchmark::readYamls(State, BenchmarkFile));		BenchmarkFile, InstructionBenchmark::readYamls(State, BenchmarkFile));

outs() << "Parsed " << Points.size() << " benchmark points\n";		outs() << "Parsed " << Points.size() << " benchmark points\n";
if (Points.empty()) {		if (Points.empty()) {
errs() << "no benchmarks to analyze\n";		errs() << "no benchmarks to analyze\n";
return;		return;
}		}
// FIXME: Check that all points have the same triple/cpu.		// FIXME: Check that all points have the same triple/cpu.
// FIXME: Merge points from several runs (latency and uops).		// FIXME: Merge points from several runs (latency and uops).


std::string Error;		std::string Error;
const auto *TheTarget =		const auto *TheTarget =
TargetRegistry::lookupTarget(Points[0].LLVMTriple, Error);		TargetRegistry::lookupTarget(Points[0].LLVMTriple, Error);
if (!TheTarget) {		if (!TheTarget) {
errs() << "unknown target '" << Points[0].LLVMTriple << "'\n";		errs() << "unknown target '" << Points[0].LLVMTriple << "'\n";
return;		return;
}		}

std::unique_ptr<MCInstrInfo> InstrInfo(TheTarget->createMCInstrInfo());		std::unique_ptr<MCInstrInfo> InstrInfo(TheTarget->createMCInstrInfo());
assert(InstrInfo && "Unable to create instruction info!");		assert(InstrInfo && "Unable to create instruction info!");

		std::unique_ptr<MCSubtargetInfo> SubtargetInfo(
		TheTarget->createMCSubtargetInfo(Points[0].LLVMTriple, CpuName, ""));
		assert(SubtargetInfo && "Unable to create subtarget info!");

		PostProcessChainedLatencyBenchmarkPoints(Points, InstrInfo, SubtargetInfo);

const auto Clustering = ExitOnErr(InstructionBenchmarkClustering::create(		const auto Clustering = ExitOnErr(InstructionBenchmarkClustering::create(
Points, AnalysisClusteringAlgorithm, AnalysisDbscanNumPoints,		Points, AnalysisClusteringAlgorithm, AnalysisDbscanNumPoints,
AnalysisClusteringEpsilon, InstrInfo->getNumOpcodes()));		AnalysisClusteringEpsilon, InstrInfo->getNumOpcodes()));

const Analysis Analyzer(*TheTarget, std::move(InstrInfo), Clustering,		const Analysis Analyzer(
AnalysisInconsistencyEpsilon,		*TheTarget, std::move(InstrInfo), std::move(SubtargetInfo), Clustering,
AnalysisDisplayUnstableOpcodes, CpuName);		AnalysisInconsistencyEpsilon, AnalysisDisplayUnstableOpcodes, CpuName);

maybeRunAnalysis<Analysis::PrintClusters>(Analyzer, "analysis clusters",		maybeRunAnalysis<Analysis::PrintClusters>(Analyzer, "analysis clusters",
AnalysisClustersOutputFile);		AnalysisClustersOutputFile);
maybeRunAnalysis<Analysis::PrintSchedClassInconsistencies>(		maybeRunAnalysis<Analysis::PrintSchedClassInconsistencies>(
Analyzer, "sched class consistency analysis",		Analyzer, "sched class consistency analysis",
AnalysisInconsistenciesOutputFile);		AnalysisInconsistenciesOutputFile);
}		}

Show All 20 Lines

llvm/unittests/Support/LinearAlgebraTest.cpp

Show First 20 Lines • Show All 169 Lines • ▼ Show 20 Lines	TEST(LinearAlgebraTest, OLS) {

M(1, 0) = 1;		M(1, 0) = 1;
M(1, 1) = 4;		M(1, 1) = 4;

Matrix<double> y(2, 1);		Matrix<double> y(2, 1);
y(0, 0) = 5;		y(0, 0) = 5;
y(1, 0) = 6;		y(1, 0) = 6;

auto beta = getOrdinaryLeastSquaresEstimation<double>(M, y);		auto beta = getOrdinaryLeastSquaresEstimator<double>(M, y);
EXPECT_EQ(std::vector<double>({4, 1. / 2}), getAllValues(beta));		EXPECT_EQ(std::vector<double>({4, 1. / 2}), getAllValues(beta));
}		}

TEST(LinearAlgebraTest, OLSOverdefined) {		TEST(LinearAlgebraTest, OLSOverdefined) {
Matrix<double> M(4, 2);		Matrix<double> M(4, 2);
M(0, 0) = 1;		M(0, 0) = 1;
M(0, 1) = 1;		M(0, 1) = 1;

M(1, 0) = 1;		M(1, 0) = 1;
M(1, 1) = 2;		M(1, 1) = 2;

M(2, 0) = 1;		M(2, 0) = 1;
M(2, 1) = 3;		M(2, 1) = 3;

M(3, 0) = 1;		M(3, 0) = 1;
M(3, 1) = 4;		M(3, 1) = 4;

Matrix<double> y(4, 1);		Matrix<double> y(4, 1);
y(0, 0) = 6;		y(0, 0) = 6;
y(1, 0) = 5;		y(1, 0) = 5;
y(2, 0) = 7;		y(2, 0) = 7;
y(3, 0) = 10;		y(3, 0) = 10;

auto beta = getOrdinaryLeastSquaresEstimation<double>(M, y);		auto beta = getOrdinaryLeastSquaresEstimator<double>(M, y);
EXPECT_EQ(std::vector<double>({7. / 2, 7. / 5}), getAllValues(beta));		EXPECT_EQ(std::vector<double>({7. / 2, 7. / 5}), getAllValues(beta));
}		}

} // namespace		} // namespace

This is an archive of the discontinued LLVM Phabricator instance.

[llvm-exegesis] Post-processing for chained instrs in latency mode (PR41275)Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 345728

llvm/include/llvm/Support/LinearAlgebra.h

llvm/lib/MC/MCSchedule.cpp

llvm/lib/Target/X86/X86ScheduleZnver3.td

llvm/test/tools/llvm-exegesis/X86/analysis-latency-instruction-chaining-domain-transfer.test

llvm/test/tools/llvm-exegesis/X86/analysis-latency-instruction-chaining.test

llvm/tools/llvm-exegesis/lib/Analysis.h

llvm/tools/llvm-exegesis/lib/Analysis.cpp

llvm/tools/llvm-exegesis/lib/CMakeLists.txt

llvm/tools/llvm-exegesis/lib/PostProcessing.h

llvm/tools/llvm-exegesis/lib/PostProcessing.cpp

llvm/tools/llvm-exegesis/llvm-exegesis.cpp

llvm/unittests/Support/LinearAlgebraTest.cpp

[llvm-exegesis] Post-processing for chained instrs in latency mode (PR41275)
Needs ReviewPublic