This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
test/tools/llvm-exegesis/X86/
-
tools/
-
llvm-exegesis/
-
X86/
2/3
analysis-latency-instruction-chaining.test
-
tools/llvm-exegesis/
-
llvm-exegesis/
-
lib/
-
Analysis.h
-
Analysis.cpp
-
CMakeLists.txt
1
PostProcessing.h
-
PostProcessing.cpp
-
llvm-exegesis.cpp

Differential D60000

[llvm-exegesis] Post-processing for chained instrs in latency mode (PR41275)
Needs ReviewPublic

Authored by lebedev.ri on Mar 29 2019, 10:52 AM.

Download Raw Diff

Details

Reviewers

courbet
gchatelet
andreadb

Summary

Ok, so this turned out to be easier than i expected.
Also, i initially thought that other modes might need this post-processing,
but i'm not sure which opcodes are affected there, if any.

The results look much better.
On BdVer2 this exposes at least one stable sched cluster
that has inconsistent values from from the measurements,
and a dozen or so somewhat-unstable clusters that also are inconsistent.

latency-clusters-stable.html165 KBDownload

latency-clusters-unstable.html1 MBDownload

Resolves(?) PR41275

Diff Detail

Repository: rL LLVM

Event Timeline

lebedev.ri created this revision.Mar 29 2019, 10:52 AM

Herald added subscribers: jdoerfert, tschuett, mgorny. · View Herald TranscriptMar 29 2019, 10:52 AM

Regarding remaining noise for these chained instrs, i'm guessing we
also need to account for some other latencies, e.g. domain crossing?

Refactor code a little

While there, also model domain transfer delays.

I'm not sure about the first instruction though,

vpextrb	$1, %xmm2, %edi
vpinsrb	$1, %edi, %xmm7, %xmm2
vpextrb	$1, %xmm2, %edi
vpinsrb	$1, %edi, %xmm7, %xmm2

don't we go fpu->int->fpu ?
Shouldn't we be also modelling the fpu2int delays?

Also, still seeing some weird noise in some of these chained instructions.

lebedev.ri added a reviewer: andreadb.Mar 30 2019, 7:19 AM

Also, only post-process 2-instruction benchmarks that were serialized
by llvm-exegesis itself, not just every n-instruction benchmarks.

I'll let gchatelet@ review that one as he started looking at doing that some time ago.

It doesn't feel like the right approach to me.
llvm-exegesis is about relying on measurement to deduce informations. Here you're using a priori knowledge (SchedClass) which may be wrong.
To me, a better approach would be to read all the experiments, create the dependency graph between the 2-instructions snippets and solve a system of equations to recover the per instruction latency, then use the analyzer on the result.

test/tools/llvm-exegesis/X86/analysis-latency-instruction-chaining.test
14	In this paragraph it is unclear what `next instruction` refers to. Maybe rephrase to something like "By constructions, snippet's instructions execution never overlaps. As a consequence the `per-snippet latency` is the sum of the latencies of the instructions in the Snippet. For instance, in the following example latency(BT32rr R11D R11D) + latency(RCR8rCL R11B R11B) = 12 "
46	Just to be sure, you crafted this InstructionBenchmark for the sake of testing right? I don't see how the current code generator can come up with three instructions.
tools/llvm-exegesis/lib/PostProcessing.h
11	Maybe explain what post processing does. As such it is not too informative.

In D60000#1451313, @gchatelet wrote:

It doesn't feel like the right approach to me.

In D60000#1451313, @gchatelet wrote:

llvm-exegesis is about relying on measurement to deduce informations. Here you're using a priori knowledge (SchedClass) which may be wrong.

Yes.

To me, a better approach would be to read all the experiments, create the dependency graph between the 2-instructions snippets and solve a system of equations to recover the per instruction latency, then use the analyzer on the result.

Can you explain that in a bit more detail? Something like

lat(i_0) = m_0
sum(lat(i_t)+lat(i_0)) = m_1
lat(i_1) = m_2
sum(lat(i_t)+lat(i_1)) = m_3
...
lat(i_n) => ?
sum(lat(i_t)+lat(i_n)) = m_n
lat(i_t) => ?

Do you suggest to take known lat(i_0)..lat(i_n) from measurements too?
How will that scheme will account for domain transfer delays?

test/tools/llvm-exegesis/X86/analysis-latency-instruction-chaining.test
46	Yes, this is just a copy-paste. As you can see, `assembled_snippet` is identical to the previous benchmark.

To me, a better approach would be to read all the experiments, create the dependency graph between the 2-instructions snippets and solve a system of equations to recover the per instruction latency, then use the analyzer on the result.

Can you explain that in a bit more detail? Something like
lat(i_0) = m_0
sum(lat(i_t)+lat(i_0)) = m_1
lat(i_1) = m_2
sum(lat(i_t)+lat(i_1)) = m_3
...
lat(i_n) => ?
sum(lat(i_t)+lat(i_n)) = m_n
lat(i_t) => ?
Do you suggest to take known lat(i_0)..lat(i_n) from measurements too?

Yes.
Measurement ought to be coherent for runs on the same CPU so with enough data the resulting linear system will be over constrained and is solvable using ordinary least square (https://en.wikipedia.org/wiki/Ordinary_least_squares).

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

In D60000#1452962, @gchatelet wrote:
To me, a better approach would be to read all the experiments, create the dependency graph between the 2-instructions snippets and solve a system of equations to recover the per instruction latency, then use the analyzer on the result.

Can you explain that in a bit more detail? Something like
lat(i_0) = m_0
sum(lat(i_t)+lat(i_0)) = m_1
lat(i_1) = m_2
sum(lat(i_t)+lat(i_1)) = m_3
...
lat(i_n) => ?
sum(lat(i_t)+lat(i_n)) = m_n
lat(i_t) => ?
Do you suggest to take known lat(i_0)..lat(i_n) from measurements too?
Yes.
Measurement ought to be coherent for runs on the same CPU so with enough data the resulting linear system will be over constrained and is solvable using ordinary least square (https://en.wikipedia.org/wiki/Ordinary_least_squares).

Okay, sounds sane.
Intermediate issue to solve: creating all these 2-instr chained configs must then also
create configs to measure the params of that second instr (without going into endless loop).

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

Hmm, i don't mean to mock, but it sounds kinda hand-wavy/arbitrary.
We don't want to use latencies of instructions specified in scheduler profile, but at the same
time we are ok with expecting that the sched profile explains all the domain transfer delays.
They should probably also be variables, not hardcoded. But i admit i have not thought that part through.

In the same thoughtflow, would be great if it could try to magically deduce the actual Units (*not* just pressure distribution)

Intermediate issue to solve: creating all these 2-instr chained configs must then also
create configs to measure the params of that second instr (without going into endless loop).

Yes it's more work and potentially really long runtimes. The randomization part is here because the original design was to run llvm-exegesis on many machines and aggregate the runs to do the analysis. The more data the better so we can test our hypotheses.

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

Hmm, i don't mean to mock

Then don't. The mere fact you're mentioning it is already suspicious :)

but it sounds kinda hand-wavy/arbitrary.

I offered three paths to explore, I don't know yet if they work, nor which one is best.
I'd need to dedicate some time to this but I don't have much right now TBH.

We don't want to use latencies of instructions specified in scheduler profile, but at the same
time we are ok with expecting that the sched profile explains all the domain transfer delays.

We want the tool to be as generic as possible to target other platforms but some decisions are to be target specific (e.g. the stack based registers of x87)
This is why llvm-exegesis relies on ExegesisTarget in llvm-exegesis/lib/Target.h.
As such it makes sense that the measurement tool is customized based on what we know or suspect about the target.
On the contrary it is a desirable property that the analysis tool to not depend on target knowledge as much.

They should probably also be variables, not hardcoded. But i admit i have not thought that part through.

This was one of the suggestions I made (DTD being variables), I agree this needs to be thought through.

In the same thoughtflow, would be great if it could try to magically deduce the actual Units (*not* just pressure distribution)

That would be nice indeed although nothing obvious comes to mind. We only have Hardware performance counter and code Snippets to recover knowledge.

In D60000#1454661, @gchatelet wrote:

Intermediate issue to solve: creating all these 2-instr chained configs must then also
create configs to measure the params of that second instr (without going into endless loop).

Yes it's more work and potentially really long runtimes. The randomization part is here because the original design was to run llvm-exegesis on many machines and aggregate the runs to do the analysis. The more data the better so we can test our hypotheses.

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

Hmm, i don't mean to mock

Then don't. The mere fact you're mentioning it is already suspicious :)

but it sounds kinda hand-wavy/arbitrary.

I offered three paths to explore, I don't know yet if they work, nor which one is best.
I'd need to dedicate some time to this but I don't have much right now TBH.

We don't want to use latencies of instructions specified in scheduler profile, but at the same
time we are ok with expecting that the sched profile explains all the domain transfer delays.

We want the tool to be as generic as possible to target other platforms but some decisions are to be target specific (e.g. the stack based registers of x87)
This is why llvm-exegesis relies on ExegesisTarget in llvm-exegesis/lib/Target.h.
As such it makes sense that the measurement tool is customized based on what we know or suspect about the target.
On the contrary it is a desirable property that the analysis tool to not depend on target knowledge as much.

What i'm saying is that it is pretty much known as a fact that LLVM (at least on X86?) has (very?) partial modelling
of these extra delays, so if we take as granted that all of them are already modelled, the post-processing
may produce rather misleading results.

They should probably also be variables, not hardcoded. But i admit i have not thought that part through.

This was one of the suggestions I made (DTD being variables), I agree this needs to be thought through.

ack

In the same thoughtflow, would be great if it could try to magically deduce the actual Units (*not* just pressure distribution)

That would be nice indeed although nothing obvious comes to mind. We only have Hardware performance counter and code Snippets to recover knowledge.

ack

lebedev.ri mentioned this in D60401: [llvm-exegesis] When generating templates with chained instructions, also add templates for helper instructions.Apr 8 2019, 2:18 AM

In D60000#1454661, @gchatelet wrote:

Intermediate issue to solve: creating all these 2-instr chained configs must then also
create configs to measure the params of that second instr (without going into endless loop).

Yes it's more work and potentially really long runtimes. The randomization part is here because the original design was to run llvm-exegesis on many machines and aggregate the runs to do the analysis. The more data the better so we can test our hypotheses.

How will that scheme will account for domain transfer delays?

You could associate a supplementary variable for pairs of instructions but this would need a lot of data to converge (way too many variables).
A simpler approach is to make sure that we don't generate domain transfer delays when generating the snippet, rejecting pairs of instructions for which it would occur.
Or annotate the results with information about domain transfer delays and deal with it in the post processing (adding variables for pairs in {int,vector,fp,store}² when we know such a transfer exist) this way we can recover the domain transfer delays as well.

Hmm, i don't mean to mock

Then don't. The mere fact you're mentioning it is already suspicious :)

but it sounds kinda hand-wavy/arbitrary.

I offered three paths to explore, I don't know yet if they work, nor which one is best.
I'd need to dedicate some time to this but I don't have much right now TBH.

Any kind of approximate time estimate on this?
Next month? This release cycle? Next year? "some day, once there is time"?

@courbet @gchatelet bump. any plans on working on that functionality anytime soon? :)
Not sure how healthy for any project such silence in form of neither working on
nor allowing other interested parties to work on missing functionality...

In D60000#1785210, @lebedev.ri wrote:

@courbet @gchatelet bump. any plans on working on that functionality anytime soon? :)
Not sure how healthy for any project such silence in form of neither working on
nor allowing other interested parties to work on missing functionality...

Yes I'll be working on it from now on. Stay tuned.

mstojanovic added a subscriber: mstojanovic.Dec 18 2019, 10:04 AM

In D60000#1785687, @gchatelet wrote:

In D60000#1785210, @lebedev.ri wrote:

@courbet @gchatelet bump. any plans on working on that functionality anytime soon? :)
Not sure how healthy for any project such silence in form of neither working on
nor allowing other interested parties to work on missing functionality...

Yes I'll be working on it from now on. Stay tuned.

I've been side tracked but this is still on my radar.

lebedev.ri mentioned this in D74156: [llvm-exegesis] Exploring X86::OperandType::OPERAND_COND_CODE.Feb 7 2020, 9:50 AM

lebedev.ri mentioned this in D75510: [X86][llvm-exegesis] Exploring vector insert/extract.Mar 3 2020, 2:26 AM

Now that i'm renewly motivated to finally get this fixed, let's try again :)

This implements linear algebra support library (i don't think we want to pull in eigen do we?),
implements ordinary/weighted least squares, and uses it to post-process latency benchmark points.

This kinda just works:

latency-clusters-ols.html198 KBDownload

latency-clusters.html151 KBDownload

But i've made two observations:

if we measured a single instruction, we can say that the measurement is precise, so it's estimator should deviate from the actual measurement by as little as possible => weighted least squares
i was really hoping i wouldn't have to deal with domain transfer delays, but the results are kinda bogus right now :) That, or there is some other issue.

Herald added a subscriber: dexonsmith. · View Herald TranscriptMay 4 2021, 12:28 PM

Actually drop pre-monorepo diff parts.

Some early comments on the linear algebra library.
Let me sync with @courbet first.

llvm/include/llvm/Support/LinearAlgebra.h
30 ↗	(On Diff #342825)	`static_assert` to make sure that `std::is_base_of<T, Derived>::value==true`
33 ↗	(On Diff #342825)	fix
37 ↗	(On Diff #342825)	you may want to provide a function to factor the cast operation in.
37–38 ↗	(On Diff #342825)	These could be properties `rows()` `columns()`.
60–64 ↗	(On Diff #342825)	fix
66–67 ↗	(On Diff #342825)	ditto - properties
82 ↗	(On Diff #342825)	`MatrixTy`
83 ↗	(On Diff #342825)	AFAIU from the implementation, this really is a `TransposedMatrixView`. I think it's important to highlight because `m` should outlive the `TransposedMatrix` object.
85 ↗	(On Diff #342825)	`matrix`
85 ↗	(On Diff #342825)	Using a reference instead of a pointer prevents rebinding of this object (can't move, can't reassign). Is it a design decision?
97 ↗	(On Diff #342825)	This is still a `View` AFAICT
140–141 ↗	(On Diff #342825)	Wrong comment
149 ↗	(On Diff #342825)	`const`
197 ↗	(On Diff #342825)	This is hard to read, do you mind introducing a variable for the expected value?
209 ↗	(On Diff #342825)	`const`
220 ↗	(On Diff #342825)	`const` here and below
352 ↗	(On Diff #342825)	`const` here and below

In D60000#2738598, @gchatelet wrote:

Some early comments on the linear algebra library.
Let me sync with @courbet first.

I'm still in early experimentation stages with this, this is not the code i'd post normally, i only posted just to avoid concurrent implementations.
I'm still getting really weird latencies, and just subtracting forwarding delays from sched model doesn't help.
This may mean that either the math is wrong, or there are many more forwarding delays that we don't model.

lebedev.ri mentioned this in D94395: [X86] AMD Zen 3 Scheduler Model.May 14 2021, 10:26 AM

Looks like we end up recovering a forwarding delay of -0.04.
It should be ~-2.

Herald added subscribers: pengfei, hiraditya. · View Herald TranscriptMay 16 2021, 2:31 PM

Harbormaster completed remote builds in B104724: Diff 345728.May 16 2021, 2:31 PM

Hm, one thing i should/could try here, is to ignore benchmarks that may have forwarding delays.
It won't help with unmodelled forwarding delays (we don't model IVec->FVec delays do we?),
but maybe it will be good-enough for some sequences..

Matt added a subscriber: Matt.Jul 27 2021, 5:23 AM

Revision Contents

Path

Size

test/

tools/

llvm-exegesis/

X86/

analysis-latency-instruction-chaining.test

59 lines

tools/

llvm-exegesis/

lib/

1 line

6 lines

1 line

33 lines

87 lines

14 lines

Diff 192914

test/tools/llvm-exegesis/X86/analysis-latency-instruction-chaining.test

This file was added.

				# RUN: llvm-exegesis -mode=analysis -benchmarks-file=%s -analysis-clusters-output-file=- -analysis-clustering-epsilon=0.5 -analysis-inconsistency-epsilon=0.5 -analysis-numpoints=1 \| FileCheck -check-prefixes=CHECK-ALL %s

				# CHECK-ALL: {{^}}cluster_id,opcode_name,config,sched_class,latency{{$}}

				# CHECK-NEXT: {{^}}0,
				# CHECK-SAME: ,1.00{{$}}
				# CHECK-NEXT: {{^}}0,
				# CHECK-SAME: ,1.00{{$}}

				# Instructions were executed serially, meaning that the next instruction
				# ONLY starts executing when the current instruction finishes.
				# Thus, the real latency of the first instruction is the per_snippet_value minus
				# the sum of latencies of all the other instructions in the snippet.

				gchateletUnsubmitted Not Done Reply Inline Actions In this paragraph it is unclear what `next instruction` refers to. Maybe rephrase to something like "By constructions, snippet's instructions execution never overlaps. As a consequence the `per-snippet latency` is the sum of the latencies of the instructions in the Snippet. For instance, in the following example latency(BT32rr R11D R11D) + latency(RCR8rCL R11B R11B) = 12 " gchatelet: In this paragraph it is unclear what `next instruction` refers to. Maybe rephrase to something…
				# RCR8rCL has latency of 11. (the value from scheduling profile!)
				# Latency of whole snipped is 12 or 23. (not measured, hand-written.)
				# Thus, latency of BT32rr is 12-11 = 1, or 23-11-11 = 1

				---
				mode: latency
				key:
				instructions:
				- 'BT32rr R11D R11D'
				- 'RCR8rCL R11B R11B'
				config: ''
				register_initial_values:
				- 'R11D=0x0'
				- 'R11B=0x0'
				- 'CL=0x0'
				cpu_name: bdver2
				llvm_triple: x86_64-unknown-linux-gnu
				num_repetitions: 10000
				measurements:
				- { key: latency, value: 0.0000, per_snippet_value: 12.0000 }
				error: ''
				info: Repeating two instructions
				assembled_snippet: 41BB0000000041B300B100450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DBC3
				...
				---
				mode: latency
				key:
				instructions:
				- 'BT32rr R11D R11D'
				- 'RCR8rCL R11B R11B'
				- 'RCR8rCL R11B R11B'
				config: ''
				gchateletUnsubmitted Done Reply Inline Actions Just to be sure, you crafted this InstructionBenchmark for the sake of testing right? I don't see how the current code generator can come up with three instructions. gchatelet: Just to be sure, you crafted this InstructionBenchmark for the sake of testing right? I don't…
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions Yes, this is just a copy-paste. As you can see, `assembled_snippet` is identical to the previous benchmark. lebedev.ri: Yes, this is just a copy-paste. As you can see, `assembled_snippet` is identical to the…
				register_initial_values:
				- 'R11D=0x0'
				- 'R11B=0x0'
				- 'CL=0x0'
				cpu_name: bdver2
				llvm_triple: x86_64-unknown-linux-gnu
				num_repetitions: 10000
				measurements:
				- { key: latency, value: 99.0000, per_snippet_value: 23.0000 }
				error: ''
				info: Repeating two instructions
				assembled_snippet: 41BB0000000041B300B100450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DB450FA3DB41D2DBC3
				...

tools/llvm-exegesis/lib/Analysis.h

	Show All 31 Lines

	namespace llvm {			namespace llvm {
	namespace exegesis {			namespace exegesis {

	// A helper class to analyze benchmark results for a target.			// A helper class to analyze benchmark results for a target.
	class Analysis {			class Analysis {
	public:			public:
	Analysis(const llvm::Target &Target,			Analysis(const llvm::Target &Target,
				std::unique_ptr<llvm::MCSubtargetInfo> SubtargetInfo,
	std::unique_ptr<llvm::MCInstrInfo> InstrInfo,			std::unique_ptr<llvm::MCInstrInfo> InstrInfo,
	const InstructionBenchmarkClustering &Clustering,			const InstructionBenchmarkClustering &Clustering,
	double AnalysisInconsistencyEpsilon,			double AnalysisInconsistencyEpsilon,
	bool AnalysisDisplayUnstableOpcodes);			bool AnalysisDisplayUnstableOpcodes);

	// Prints a csv of instructions for each cluster.			// Prints a csv of instructions for each cluster.
	struct PrintClusters {};			struct PrintClusters {};
	// Find potential errors in the scheduling information given measurements.			// Find potential errors in the scheduling information given measurements.
	▲ Show 20 Lines • Show All 78 Lines • Show Last 20 Lines

tools/llvm-exegesis/lib/Analysis.cpp

Show First 20 Lines • Show All 152 Lines • ▼ Show 20 Lines	#endif
for (const auto &Measurement : Point.Measurements) {		for (const auto &Measurement : Point.Measurements) {
OS << kCsvSep;		OS << kCsvSep;
writeMeasurementValue<kEscapeCsv>(OS, Measurement.PerInstructionValue);		writeMeasurementValue<kEscapeCsv>(OS, Measurement.PerInstructionValue);
}		}
OS << "\n";		OS << "\n";
}		}

Analysis::Analysis(const llvm::Target &Target,		Analysis::Analysis(const llvm::Target &Target,
		std::unique_ptr<MCSubtargetInfo> SubtargetInfo,
std::unique_ptr<llvm::MCInstrInfo> InstrInfo,		std::unique_ptr<llvm::MCInstrInfo> InstrInfo,
const InstructionBenchmarkClustering &Clustering,		const InstructionBenchmarkClustering &Clustering,
double AnalysisInconsistencyEpsilon,		double AnalysisInconsistencyEpsilon,
bool AnalysisDisplayUnstableOpcodes)		bool AnalysisDisplayUnstableOpcodes)
: Clustering_(Clustering), InstrInfo_(std::move(InstrInfo)),		: Clustering_(Clustering), SubtargetInfo_(std::move(SubtargetInfo)),
		InstrInfo_(std::move(InstrInfo)),
AnalysisInconsistencyEpsilonSquared_(AnalysisInconsistencyEpsilon *		AnalysisInconsistencyEpsilonSquared_(AnalysisInconsistencyEpsilon *
AnalysisInconsistencyEpsilon),		AnalysisInconsistencyEpsilon),
AnalysisDisplayUnstableOpcodes_(AnalysisDisplayUnstableOpcodes) {		AnalysisDisplayUnstableOpcodes_(AnalysisDisplayUnstableOpcodes) {
if (Clustering.getPoints().empty())		if (Clustering.getPoints().empty())
return;		return;

const InstructionBenchmark &FirstPoint = Clustering.getPoints().front();		const InstructionBenchmark &FirstPoint = Clustering.getPoints().front();
RegInfo_.reset(Target.createMCRegInfo(FirstPoint.LLVMTriple));		RegInfo_.reset(Target.createMCRegInfo(FirstPoint.LLVMTriple));
AsmInfo_.reset(Target.createMCAsmInfo(*RegInfo_, FirstPoint.LLVMTriple));		AsmInfo_.reset(Target.createMCAsmInfo(*RegInfo_, FirstPoint.LLVMTriple));
SubtargetInfo_.reset(Target.createMCSubtargetInfo(FirstPoint.LLVMTriple,
FirstPoint.CpuName, ""));
InstPrinter_.reset(Target.createMCInstPrinter(		InstPrinter_.reset(Target.createMCInstPrinter(
llvm::Triple(FirstPoint.LLVMTriple), 0 /default variant/, *AsmInfo_,		llvm::Triple(FirstPoint.LLVMTriple), 0 /default variant/, *AsmInfo_,
InstrInfo_, RegInfo_));		InstrInfo_, RegInfo_));

Context_ = llvm::make_unique<llvm::MCContext>(AsmInfo_.get(), RegInfo_.get(),		Context_ = llvm::make_unique<llvm::MCContext>(AsmInfo_.get(), RegInfo_.get(),
&ObjectFileInfo_);		&ObjectFileInfo_);
Disasm_.reset(Target.createMCDisassembler(SubtargetInfo_, Context_));		Disasm_.reset(Target.createMCDisassembler(SubtargetInfo_, Context_));
assert(Disasm_ && "cannot create MCDisassembler. missing call to "		assert(Disasm_ && "cannot create MCDisassembler. missing call to "
▲ Show 20 Lines • Show All 383 Lines • Show Last 20 Lines

tools/llvm-exegesis/lib/CMakeLists.txt

Show All 21 Lines	add_library(LLVMExegesis
BenchmarkResult.cpp		BenchmarkResult.cpp
BenchmarkRunner.cpp		BenchmarkRunner.cpp
Clustering.cpp		Clustering.cpp
CodeTemplate.cpp		CodeTemplate.cpp
Latency.cpp		Latency.cpp
LlvmState.cpp		LlvmState.cpp
MCInstrDescView.cpp		MCInstrDescView.cpp
PerfHelper.cpp		PerfHelper.cpp
		PostProcessing.cpp
RegisterAliasing.cpp		RegisterAliasing.cpp
RegisterValue.cpp		RegisterValue.cpp
SchedClassResolution.cpp		SchedClassResolution.cpp
SnippetGenerator.cpp		SnippetGenerator.cpp
Target.cpp		Target.cpp
Uops.cpp		Uops.cpp
)		)

Show All 21 Lines

tools/llvm-exegesis/lib/PostProcessing.h

This file was added.

				//===-- PostProcessing.h ----------------------------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				///
				/// \file
				/// Post-processing for the benchmark points.
				///
				gchateletUnsubmitted Not Done Reply Inline Actions Maybe explain what post processing does. As such it is not too informative. gchatelet: Maybe explain what post processing does. As such it is not too informative.
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_TOOLS_LLVM_EXEGESIS_POSTPROCESSING_H
				#define LLVM_TOOLS_LLVM_EXEGESIS_POSTPROCESSING_H

				#include "BenchmarkResult.h"
				#include "llvm/ADT/ArrayRef.h"
				#include "llvm/MC/MCInstrInfo.h"
				#include "llvm/MC/MCSubtargetInfo.h"

				namespace llvm {
				namespace exegesis {

				void PostProcessBenchmarkPoints(
				const llvm::MCSubtargetInfo &SubtargetInfo,
				const llvm::MCInstrInfo &InstrInfo,
				llvm::MutableArrayRef<InstructionBenchmark> Points);

				} // namespace exegesis
				} // namespace llvm

				#endif // LLVM_TOOLS_LLVM_EXEGESIS_POSTPROCESSING_H

tools/llvm-exegesis/lib/PostProcessing.cpp

This file was added.

				//===-- PostProcessing.cpp --------------------------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "PostProcessing.h"
				#include "Clustering.h"
				#include "SchedClassResolution.h"
				#include "llvm/ADT/STLExtras.h"
				#include <utility>

				namespace llvm {
				namespace exegesis {

				static std::vector<BenchmarkMeasure>
				GetSchedDataAsPoint(const llvm::MCSubtargetInfo &SubtargetInfo,
				const llvm::MCInstrInfo &InstrInfo,
				const llvm::MCInst &Instr,
				const InstructionBenchmark &Point,
				const SchedClassClusterCentroid &Centroid) {
				// 1. Resolve sched class id of the instruction
				std::pair<unsigned /SchedClassId/, bool /WasVariant/> ID =
				ResolvedSchedClass::resolveSchedClassId(SubtargetInfo, InstrInfo, Instr);

				// 2. Produce ResolvedSchedClass for the resolved sched class id.
				ResolvedSchedClass RSC(SubtargetInfo, ID.first, ID.second);

				// 3. Convert ResolvedSchedClass into a 'benchmark point'.
				// We need Centroid only for the Keys though.
				return RSC.getAsPoint(Point.Mode, SubtargetInfo, Centroid.getStats());
				}

				static void PostProcessPoint(const llvm::MCSubtargetInfo &SubtargetInfo,
				const llvm::MCInstrInfo &InstrInfo,
				InstructionBenchmark &Point) {
				assert(Point.Key.Instructions.size() > 1 && "Should have more than 1 instr.");

				// 1. Produce a centroid out of the measured values.
				// We only need it for the Keys and validatation though.
				SchedClassClusterCentroid Centroid;
				Centroid.addPoint(Point.Measurements);
				if (!Centroid.validate(Point.Mode)) // Ignore error points.
				return;

				// 2. Replace invalid per-instr value with valid per-snippet value.
				// The benchmarking code blindly divided per-snippet value by the instr count.
				llvm::for_each(Point.Measurements, [](BenchmarkMeasure &Measure) {
				Measure.PerInstructionValue = Measure.PerSnippetValue;
				});

				// 3. And finally, subtract the SchedClass-specified values of the extra
				// instructions from the measured values, thus leaving only the value
				// that actually belongs to the first instruction.
				for (const llvm::MCInst &Instr :
				ArrayRef<llvm::MCInst>(Point.Key.Instructions).drop_front()) {
				std::vector<BenchmarkMeasure> Measures =
				GetSchedDataAsPoint(SubtargetInfo, InstrInfo, Instr, Point, Centroid);
				if (Measures.empty()) // Ignore malformed benchmarks. This won't cause
				return; // corruptions because if this fails it will fail the first time.
				assert(Point.Measurements.size() == Measures.size() &&
				"Expected dimensions for measured and computed values to match.");
				for (const auto &I : llvm::zip(Point.Measurements, Measures))
				std::get<0>(I).PerInstructionValue -= std::get<1>(I).PerInstructionValue;
				}
				}

				static bool ShouldPostProcess(InstructionBenchmark &Point) {
				// If the benchmark contains more than one instruction, then we will want to
				// post-process the measurements to remove the noise from those extra instrs.
				return Point.Mode == InstructionBenchmark::ModeE::Latency &&
				Point.Key.Instructions.size() > 1 && !Point.Measurements.empty();
				}

				void PostProcessBenchmarkPoints(
				const llvm::MCSubtargetInfo &SubtargetInfo,
				const llvm::MCInstrInfo &InstrInfo,
				llvm::MutableArrayRef<InstructionBenchmark> Points) {
				for (InstructionBenchmark &Point :
				llvm::make_filter_range(Points, ShouldPostProcess))
				PostProcessPoint(SubtargetInfo, InstrInfo, Point);
				}

				} // namespace exegesis
				} // namespace llvm

tools/llvm-exegesis/llvm-exegesis.cpp

Show All 11 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "lib/Analysis.h"		#include "lib/Analysis.h"
#include "lib/BenchmarkResult.h"		#include "lib/BenchmarkResult.h"
#include "lib/BenchmarkRunner.h"		#include "lib/BenchmarkRunner.h"
#include "lib/Clustering.h"		#include "lib/Clustering.h"
#include "lib/LlvmState.h"		#include "lib/LlvmState.h"
#include "lib/PerfHelper.h"		#include "lib/PerfHelper.h"
		#include "lib/PostProcessing.h"
#include "lib/Target.h"		#include "lib/Target.h"
#include "llvm/ADT/StringExtras.h"		#include "llvm/ADT/StringExtras.h"
#include "llvm/ADT/Twine.h"		#include "llvm/ADT/Twine.h"
#include "llvm/MC/MCInstBuilder.h"		#include "llvm/MC/MCInstBuilder.h"
#include "llvm/MC/MCObjectFileInfo.h"		#include "llvm/MC/MCObjectFileInfo.h"
#include "llvm/MC/MCParser/MCAsmParser.h"		#include "llvm/MC/MCParser/MCAsmParser.h"
#include "llvm/MC/MCParser/MCTargetAsmParser.h"		#include "llvm/MC/MCParser/MCTargetAsmParser.h"
#include "llvm/MC/MCRegisterInfo.h"		#include "llvm/MC/MCRegisterInfo.h"
▲ Show 20 Lines • Show All 416 Lines • ▼ Show 20 Lines	llvm::report_fatal_error(
"--analysis-inconsistencies-output-file must be specified.");		"--analysis-inconsistencies-output-file must be specified.");
}		}

llvm::InitializeNativeTarget();		llvm::InitializeNativeTarget();
llvm::InitializeNativeTargetAsmPrinter();		llvm::InitializeNativeTargetAsmPrinter();
llvm::InitializeNativeTargetDisassembler();		llvm::InitializeNativeTargetDisassembler();
// Read benchmarks.		// Read benchmarks.
const LLVMState State("");		const LLVMState State("");
const std::vector<InstructionBenchmark> Points =		std::vector<InstructionBenchmark> Points =
ExitOnErr(InstructionBenchmark::readYamls(State, BenchmarkFile));		ExitOnErr(InstructionBenchmark::readYamls(State, BenchmarkFile));
llvm::outs() << "Parsed " << Points.size() << " benchmark points\n";		llvm::outs() << "Parsed " << Points.size() << " benchmark points\n";
if (Points.empty()) {		if (Points.empty()) {
llvm::errs() << "no benchmarks to analyze\n";		llvm::errs() << "no benchmarks to analyze\n";
return;		return;
}		}
// FIXME: Check that all points have the same triple/cpu.		// FIXME: Check that all points have the same triple/cpu.
// FIXME: Merge points from several runs (latency and uops).		// FIXME: Merge points from several runs (latency and uops).

std::string Error;		std::string Error;
const auto *TheTarget =		const auto *TheTarget =
llvm::TargetRegistry::lookupTarget(Points[0].LLVMTriple, Error);		llvm::TargetRegistry::lookupTarget(Points[0].LLVMTriple, Error);
if (!TheTarget) {		if (!TheTarget) {
llvm::errs() << "unknown target '" << Points[0].LLVMTriple << "'\n";		llvm::errs() << "unknown target '" << Points[0].LLVMTriple << "'\n";
return;		return;
}		}

		std::unique_ptr<llvm::MCSubtargetInfo> SubtargetInfo(
		TheTarget->createMCSubtargetInfo(Points[0].LLVMTriple, Points[0].CpuName,
		""));
std::unique_ptr<llvm::MCInstrInfo> InstrInfo(TheTarget->createMCInstrInfo());		std::unique_ptr<llvm::MCInstrInfo> InstrInfo(TheTarget->createMCInstrInfo());

		PostProcessBenchmarkPoints(SubtargetInfo, InstrInfo, Points);

const auto Clustering = ExitOnErr(InstructionBenchmarkClustering::create(		const auto Clustering = ExitOnErr(InstructionBenchmarkClustering::create(
Points, AnalysisClusteringAlgorithm, AnalysisDbscanNumPoints,		Points, AnalysisClusteringAlgorithm, AnalysisDbscanNumPoints,
AnalysisClusteringEpsilon, InstrInfo->getNumOpcodes()));		AnalysisClusteringEpsilon, InstrInfo->getNumOpcodes()));

const Analysis Analyzer(*TheTarget, std::move(InstrInfo), Clustering,		const Analysis Analyzer(
AnalysisInconsistencyEpsilon,		*TheTarget, std::move(SubtargetInfo), std::move(InstrInfo), Clustering,
AnalysisDisplayUnstableOpcodes);		AnalysisInconsistencyEpsilon, AnalysisDisplayUnstableOpcodes);

maybeRunAnalysis<Analysis::PrintClusters>(Analyzer, "analysis clusters",		maybeRunAnalysis<Analysis::PrintClusters>(Analyzer, "analysis clusters",
AnalysisClustersOutputFile);		AnalysisClustersOutputFile);
maybeRunAnalysis<Analysis::PrintSchedClassInconsistencies>(		maybeRunAnalysis<Analysis::PrintSchedClassInconsistencies>(
Analyzer, "sched class consistency analysis",		Analyzer, "sched class consistency analysis",
AnalysisInconsistenciesOutputFile);		AnalysisInconsistenciesOutputFile);
}		}

Show All 20 Lines