This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/tools/llvm-mca/
-
tools/
-
llvm-mca/
-
include/
-
Context.h
-
Stages/
-
FetchStage.h
-
lib/
-
Context.cpp
-
Stages/
-
FetchStage.cpp
-
llvm-mca.cpp

Differential D53055

[MCA] Limit the number of bytes fetched per cycle.
Needs ReviewPublic

Authored by orodley on Oct 9 2018, 5:06 PM.

Download Raw Diff

Details

Reviewers

andreadb
courbet
gchatelet
atrick
RKSimon

Summary

For an example of this, see the Intel Optimization Reference Manual,
2.5.2.2 ("Instruction Fetch Unit"):

An instruction fetch is a 16-byte aligned lookup through the ITLB into
the instruction cache and instruction prefetch buffers. A hit in the
instruction cache causes 16 bytes to be delivered to the instruction
predecoder.

The model here (fetching exactly 16 bytes every cycle) is incomplete, as
it assumes every instruction is aligned and in icache, but it's an
improvement.

This is usually not a bottleneck, as stalls elsewhere will hide it, but
it can make a difference in some cases. Further work would be to
classify these cases in the report output - if there are unutilised
issue slots while there is no backend stall, we are frontend bound.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 23625
Build 23624: arc lint + arc unit

Event Timeline

orodley created this revision.Oct 9 2018, 5:06 PM

Herald added a reviewer: andreadb. · View Herald TranscriptOct 9 2018, 5:06 PM

Herald added subscribers: llvm-commits, gbedwell. · View Herald Transcript

Harbormaster completed remote builds in B23625: Diff 168914.Oct 9 2018, 5:08 PM

orodley retitled this revision from [MCA] Constrain the number of bytes fetched per cycle. to [MCA] Limit the number of bytes fetched per cycle..Oct 9 2018, 7:15 PM

This should eventually be configurable in the sched target, right?
Also, a single fetch 'queue' also may not fully fit everyone.
https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf

While previous AMD64 family 15h processors had a single 32-byte fetch window, AMD Family 15h,
models 30h–4Fh processors have two 32-byte fetch windows, from which three micro-ops can be
selected.

Hi Owen,

The default pipeline in llvm-mca doesn't simulate any hardware frontend logic.

The Fetch Stage in llvm-mca is only responsible for creating instructions and moving them to the next pipeline stage.
It doesn't have to be confused with the Fetch logic in the hardware frontend, which - as you wrote - is responsible for fetching portions of a cache line every cycle, and feed them to the decoders via an instruction byte queue.

The llvm-mca Fetch stage is equivalent to an unbounded queue of already decoded instructions. Instructions from every iteration are immediately available at cycle 0.

About the topic of simulating the hardware frontend logic:

There is already bug https://bugs.llvm.org/show_bug.cgi?id=36665, which is about adding support for simulating the hardware frontend logic.
I know that @courbet and his team would like to work on it. So, you can probably try to work with them on this.
Unfortunately, that bugzilla must be updated. There is not enough information there (I suggested to send a detailed RFC upstream in case).

I strongly suggest you/your team/Clement's team to work together on that task. I am afraid that people may be working on the same tasks in parallel.. That has to be avoided.
You can use that bugzilla to coordinate your work upsteam on this.

Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
For now, any frontend simulation should be implemented by stages that are not part of the default pipeline. The default pipeline should only stay focused on simulating the hardware backend logic.

In future, Subtargets will be able to declare custom pipelines via tablegen.
The default pipeline would only be be used by subtargets that don't provide/specify their own custom pipeline.
That being said, this is a long term goal, and it has to be properly discussed in an RFC. It will also affect the future of PR36665.

-Andrea

Hi Andrea,

There is already bug https://bugs.llvm.org/show_bug.cgi?id=36665, which is about adding support for simulating the hardware frontend logic.
I know that @courbet and his team would like to work on it. So, you can probably try to work with them on this.
Unfortunately, that bugzilla must be updated. There is not enough information there (I suggested to send a detailed RFC upstream in case).

I strongly suggest you/your team/Clement's team to work together on that task. I am afraid that people may be working on the same tasks in parallel.. That has to be avoided.
You can use that bugzilla to coordinate your work upsteam on this.

Let me clarify this: Owen is working with us :) He has taken over the genetic scheduler work I presented at EuroLLVM. One of the bottlenecks we had was the frontend hence the change. I agree that this should have been made clearer (@owenrodley, can you create a bugzilla account and assign the bug to yourself ?)

In D53055#1260195, @andreadb wrote:

The default pipeline in llvm-mca doesn't simulate any hardware frontend logic.

The Fetch Stage in llvm-mca is only responsible for creating instructions and moving them to the next pipeline stage.
It doesn't have to be confused with the Fetch logic in the hardware frontend, which - as you wrote - is responsible for fetching portions of a cache line every cycle, and feed them to the decoders via an instruction byte queue.

The llvm-mca Fetch stage is equivalent to an unbounded queue of already decoded instructions. Instructions from every iteration are immediately available at cycle 0.

All of this sounds more like a naming issue than an issue about what Owen is trying to implement.
Maybe we could rename FetchState into DecodedStage or something like this ?

Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
For now, any frontend simulation should be implemented by stages that are not part of the default pipeline. The default pipeline should only stay focused on simulating the hardware backend logic.

I guess you meant to say: "the default pipeline should only stay focused on simulating what is modeled in the MCSchedModel" ? If we can carve out something that is common to all frontends, then it could end up in MCSchedModel, and then be in the default llvm-mca pipeline. (BTW the resource pointed to by Roman shows that the approach here might not be generic enough).
Until then it feels like the flag is a low-cost approach to implementing this.

In D53055#1261982, @courbet wrote:

Hi Andrea,

There is already bug https://bugs.llvm.org/show_bug.cgi?id=36665, which is about adding support for simulating the hardware frontend logic.
I know that @courbet and his team would like to work on it. So, you can probably try to work with them on this.
Unfortunately, that bugzilla must be updated. There is not enough information there (I suggested to send a detailed RFC upstream in case).

I strongly suggest you/your team/Clement's team to work together on that task. I am afraid that people may be working on the same tasks in parallel.. That has to be avoided.
You can use that bugzilla to coordinate your work upsteam on this.

Let me clarify this: Owen is working with us :) He has taken over the genetic scheduler work I presented at EuroLLVM. One of the bottlenecks we had was the frontend hence the change. I agree that this should have been made clearer (@owenrodley, can you create a bugzilla account and assign the bug to yourself ?)

Okay. Good to know that there is no overlap :-).

In D53055#1260195, @andreadb wrote:

The default pipeline in llvm-mca doesn't simulate any hardware frontend logic.

The Fetch Stage in llvm-mca is only responsible for creating instructions and moving them to the next pipeline stage.
It doesn't have to be confused with the Fetch logic in the hardware frontend, which - as you wrote - is responsible for fetching portions of a cache line every cycle, and feed them to the decoders via an instruction byte queue.

The llvm-mca Fetch stage is equivalent to an unbounded queue of already decoded instructions. Instructions from every iteration are immediately available at cycle 0.

All of this sounds more like a naming issue than an issue about what Owen is trying to implement.
Maybe we could rename FetchState into DecodedStage or something like this ?

The "FetchStage" is literally just there to create instructions and move them to the next stage.

It is just an artificial entrypoint stage that acts as a "data source" (where data is instructions). It doesn't try to model any hardware frontend concepts.

Its original name was "InstructionFormationStage". We ended up calling it "FetchStage"; in retrospect, it was not a good name because it causes ambiguity.
We may revert to that name if you like.

Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
For now, any frontend simulation should be implemented by stages that are not part of the default pipeline. The default pipeline should only stay focused on simulating the hardware backend logic.

I guess you meant to say: "the default pipeline should only stay focused on simulating what is modeled in the MCSchedModel" ? If we can carve out something that is common to all frontends, then it could end up in MCSchedModel, and then be in the default llvm-mca pipeline. (BTW the resource pointed to by Roman shows that the approach here might not be generic enough).
Until then it feels like the flag is a low-cost approach to implementing this.

We shouldn't make any changes to the current FetchStage.
Changes to other stages of the default pipeline are okay as long as we can demonstrate that those are beneficial for all the simulated processors.

Any other big change should require an RFC. The introduction of a new stage in the default pipeline should also be justified as it affects both simulation time, and potentially the quality of the analysis.

Essentially, I want to see an RFC on how your team wants to model the frontend simulation.
There are too many aspects that cannot be accurately modelled by a static analysis tool: branch prediction, loop buffers, different decoding paths, decoders with different capabilities, instruction byte windows, instruction decoder's queue,etc.
If we want to do that as part of the default pipeline, then we have to be extremely careful and do it right.

If we don't describe the hardware frontend correctly, we risk to pessimize the analysis rather than improve it. If we decide to add it to the default pipeline, then - at least to start - it should be opt-in for the targets (stages are not added unless the scheduling model for that subtarget explicitly ask for them).

About this patch:
the number of bytes fetched is not meaningful for the current "FetchStage".
The "FetchStage" doesn't/shouldn't care about how many bytes an instruction has. More importantly, our "intermediate form" is an already decoded instruction; the whole idea of checking how many bytes an instruction is at that stage is odd. We don't simulate the hardware frontend in the default pipeline (at least, not for now).
You need a separate stage for that. So, for now, sorry but I don't think it is a good compromise.

mattd added a subscriber: mattd.Oct 11 2018, 11:54 AM

In D53055#1262077, @andreadb wrote:

The "FetchStage" is literally just there to create instructions and move them to the next stage.

It is just an artificial entrypoint stage that acts as a "data source" (where data is instructions). It doesn't try to model any hardware frontend concepts.

Its original name was "InstructionFormationStage". We ended up calling it "FetchStage"; in retrospect, it was not a good name because it causes ambiguity.
We may revert to that name if you like.

Yes, I think that would make sense.

Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
For now, any frontend simulation should be implemented by stages that are not part of the default pipeline.

I don't have a particular opinion about whether this should be part of the default pipeline or not, but I think modeling the frontend is very important.
This article from a couple years ago analyzes the typical workloads on a google datacenter. While most of the stalls are from the backend, the frontend has a significant contribution:

"Front-end core stalls account for 15-30% of all pipeline slots, with many workloads showing 5-10% of cycles completely starved on instructions".

The authors found this trend to be increasing over time. The article shows that i-cache misses are a large part of these stalls, which is going to be hard to model statically.
However, we also found out that large computation kernels typically had frontend stalls in fetch&decode due to the large size of vector instructions (the Intel fetch window is 16 bytes). We had some nice wins based on llvm_sim, which does simulate the frontend. We've made two of these wins public
(https://github.com/webmproject/libwebp/commit/67748b41dbb21a43e88f2b6ddf6117f4338873a3, https://github.com/google/gemmlowp/pull/91).

I think we agreed during EuroLLVM last year that we should standardize on a single tool to avoid duplicating effort, and that that this tool should be llvm-mca. That means that over time llvm-mca needs to grow to support more use cases.

The default pipeline should only stay focused on simulating the hardware backend logic.

I guess you meant to say: "the default pipeline should only stay focused on simulating what is modeled in the MCSchedModel" ? If we can carve out something that is common to all frontends, then it could end up in MCSchedModel, and then be in the default llvm-mca pipeline. (BTW the resource pointed to by Roman shows that the approach here might not be generic enough).
Until then it feels like the flag is a low-cost approach to implementing this.

We shouldn't make any changes to the current FetchStage.
Changes to other stages of the default pipeline are okay as long as we can demonstrate that those are beneficial for all the simulated processors.

I think that's fair. How would you feel about moving the change in this patch into a separate stage ? The flag would then turn on adding the stage so that we can experiment with various CPUs. If that turns out to be useful, we can discuss adding fetch modeling to the MCSchedModel, which will then allow us to add this to the default pipeline in a principled way.

Any other big change should require an RFC. The introduction of a new stage in the default pipeline should also be justified as it affects both simulation time, and potentially the quality of the analysis.

Essentially, I want to see an RFC on how your team wants to model the frontend simulation.
There are too many aspects that cannot be accurately modelled by a static analysis tool: branch prediction, loop buffers, different decoding paths, decoders with different capabilities, instruction byte windows, instruction decoder's queue,etc.
If we want to do that as part of the default pipeline, then we have to be extremely careful and do it right.

If we don't describe the hardware frontend correctly, we risk to pessimize the analysis rather than improve it. If we decide to add it to the default pipeline, then - at least to start - it should be opt-in for the targets (stages are not added unless the scheduling model for that subtarget explicitly ask for them).

About this patch:
the number of bytes fetched is not meaningful for the current "FetchStage".
The "FetchStage" doesn't/shouldn't care about how many bytes an instruction has. More importantly, our "intermediate form" is an already decoded instruction; the whole idea of checking how many bytes an instruction is at that stage is odd. We don't simulate the hardware frontend in the default pipeline (at least, not for now).
You need a separate stage for that. So, for now, sorry but I don't think it is a good compromise.

In D53055#1263046, @courbet wrote:

In D53055#1262077, @andreadb wrote:

Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
For now, any frontend simulation should be implemented by stages that are not part of the default pipeline.

I don't have a particular opinion about whether this should be part of the default pipeline or not, but I think modeling the frontend is very important.
This article from a couple years ago analyzes the typical workloads on a google datacenter. While most of the stalls are from the backend, the frontend has a significant contribution:

I think that we are on the same page.
I have nothing against having frontend analysis: we want to be able to identify frontend bottlenecks.
My point was more about the "development process". I think we need to agree on a plan, and have a good roadmap. It is difficult to evaluate small incremental patches like this if we don't have a "vision". We should have at least an idea on what will be the next steps.

"Front-end core stalls account for 15-30% of all pipeline slots, with many workloads showing 5-10% of cycles completely starved on instructions".

The authors found this trend to be increasing over time. The article shows that i-cache misses are a large part of these stalls, which is going to be hard to model statically.
However, we also found out that large computation kernels typically had frontend stalls in fetch&decode due to the large size of vector instructions (the Intel fetch window is 16 bytes). We had some nice wins based on llvm_sim, which does simulate the frontend. We've made two of these wins public
(https://github.com/webmproject/libwebp/commit/67748b41dbb21a43e88f2b6ddf6117f4338873a3, https://github.com/google/gemmlowp/pull/91).

I think we agreed during EuroLLVM last year that we should standardize on a single tool to avoid duplicating effort, and that that this tool should be llvm-mca. That means that over time llvm-mca needs to grow to support more use cases.

The default pipeline should only stay focused on simulating the hardware backend logic.

I guess you meant to say: "the default pipeline should only stay focused on simulating what is modeled in the MCSchedModel" ? If we can carve out something that is common to all frontends, then it could end up in MCSchedModel, and then be in the default llvm-mca pipeline. (BTW the resource pointed to by Roman shows that the approach here might not be generic enough).
Until then it feels like the flag is a low-cost approach to implementing this.

We shouldn't make any changes to the current FetchStage.
Changes to other stages of the default pipeline are okay as long as we can demonstrate that those are beneficial for all the simulated processors.

I think that's fair. How would you feel about moving the change in this patch into a separate stage ? The flag would then turn on adding the stage so that we can experiment with various CPUs. If that turns out to be useful, we can discuss adding fetch modeling to the MCSchedModel, which will then allow us to add this to the default pipeline in a principled way.

I think that is the right way to go. It would unblock your work in the short term, and give us time to evaluate the quality of the new logic without affecting the default analysis pipeline.

This is pretty much what I was suggesting in my previous comment (i.e. have frontend logic/process defined by separate stages that runs before "DispatchStage"). The current FetchStage will be renamed (I would do that after the conference if it is not a problem...), and it would still be the first stage to run. New stages would be marked as "experimental" to start. So that those are opt-in for subtargets.

P.s.: the new stage should have the concept of cache line and alignment. So that we can experiment different alignment constraints for the input code block. My understanding is that this new Fetch stage models the interaction with an IC (instruction cache); processor models should be able to customize what portion of a cache line can be picked every cycle.

Note however that this new stage may not be always enabled if the processor implements a loop buffer.
For example, instructions may be picked from a loop buffer, and not use the legacy decoders path (where instructions are fetched from the IC first). The throughput from the loop buffer normally differs from the throughput from the decoders, and it may be subject to different constraints (i.e. not the size in bytes of an instruction).
So, I am curious to see how you plan to model those frontend aspects. We may want to have to separate simulations: one where we always assume the IC path; another where we assume instructions are always picked from a loop buffer.
In practice, the choice of whether opcodes are contributed by the legacy decoder's path or not depends on the feedback from the branch predictor, and the size of a code snippet. So, the question (not for this patch) is: how much we want to complicate the model? (that is why I was originally pushing for an RFC; I didn't mean to be annoying...). Should we care about modelling (at least a few) aspects of the branch predictor? We don't have to answer to these questions now; I just wanted to further clarify why I feel cautious when it comes to modelling the frontend logic.

-Andrea

In D53055#1263141, @andreadb wrote:

In D53055#1263046, @courbet wrote:

In D53055#1262077, @andreadb wrote:

Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
For now, any frontend simulation should be implemented by stages that are not part of the default pipeline.

I don't have a particular opinion about whether this should be part of the default pipeline or not, but I think modeling the frontend is very important.
This article from a couple years ago analyzes the typical workloads on a google datacenter. While most of the stalls are from the backend, the frontend has a significant contribution:

I think that we are on the same page.
I have nothing against having frontend analysis: we want to be able to identify frontend bottlenecks.
My point was more about the "development process". I think we need to agree on a plan, and have a good roadmap. It is difficult to evaluate small incremental patches like this if we don't have a "vision". We should have at least an idea on what will be the next steps.

"Front-end core stalls account for 15-30% of all pipeline slots, with many workloads showing 5-10% of cycles completely starved on instructions".

The authors found this trend to be increasing over time. The article shows that i-cache misses are a large part of these stalls, which is going to be hard to model statically.
However, we also found out that large computation kernels typically had frontend stalls in fetch&decode due to the large size of vector instructions (the Intel fetch window is 16 bytes). We had some nice wins based on llvm_sim, which does simulate the frontend. We've made two of these wins public
(https://github.com/webmproject/libwebp/commit/67748b41dbb21a43e88f2b6ddf6117f4338873a3, https://github.com/google/gemmlowp/pull/91).

I think we agreed during EuroLLVM last year that we should standardize on a single tool to avoid duplicating effort, and that that this tool should be llvm-mca. That means that over time llvm-mca needs to grow to support more use cases.

The default pipeline should only stay focused on simulating the hardware backend logic.

I guess you meant to say: "the default pipeline should only stay focused on simulating what is modeled in the MCSchedModel" ? If we can carve out something that is common to all frontends, then it could end up in MCSchedModel, and then be in the default llvm-mca pipeline. (BTW the resource pointed to by Roman shows that the approach here might not be generic enough).
Until then it feels like the flag is a low-cost approach to implementing this.

We shouldn't make any changes to the current FetchStage.
Changes to other stages of the default pipeline are okay as long as we can demonstrate that those are beneficial for all the simulated processors.

I think that's fair. How would you feel about moving the change in this patch into a separate stage ? The flag would then turn on adding the stage so that we can experiment with various CPUs. If that turns out to be useful, we can discuss adding fetch modeling to the MCSchedModel, which will then allow us to add this to the default pipeline in a principled way.

I think that is the right way to go. It would unblock your work in the short term, and give us time to evaluate the quality of the new logic without affecting the default analysis pipeline.

This is pretty much what I was suggesting in my previous comment (i.e. have frontend logic/process defined by separate stages that runs before "DispatchStage"). The current FetchStage will be renamed (I would do that after the conference if it is not a problem...), and it would still be the first stage to run. New stages would be marked as "experimental" to start. So that those are opt-in for subtargets.

P.s.: the new stage should have the concept of cache line and alignment. So that we can experiment different alignment constraints for the input code block. My understanding is that this new Fetch stage models the interaction with an IC (instruction cache); processor models should be able to customize what portion of a cache line can be picked every cycle.

Note however that this new stage may not be always enabled if the processor implements a loop buffer.
For example, instructions may be picked from a loop buffer, and not use the legacy decoders path (where instructions are fetched from the IC first). The throughput from the loop buffer normally differs from the throughput from the decoders, and it may be subject to different constraints (i.e. not the size in bytes of an instruction).

So, I am curious to see how you plan to model those frontend aspects. We may want to have to separate simulations: one where we always assume the IC path; another where we assume instructions are always picked from a loop buffer.
In practice, the choice of whether opcodes are contributed by the legacy decoder's path or not depends on the feedback from the branch predictor, and the size of a code snippet. So, the question (not for this patch) is: how much we want to complicate the model? (that is why I was originally pushing for an RFC; I didn't mean to be annoying...). Should we care about modelling (at least a few) aspects of the branch predictor? We don't have to answer to these questions now; I just wanted to further clarify why I feel cautious when it comes to modelling the frontend logic.

I fully agree: The approach we took in llvm_sim is to tell the simulator whether we're in a loop or not with a flag. In our implementation we always went through the legacy decoder because our goal was to improve scheduling of large blocks (because that's where rescheduling really makes a difference). And that's something that we could generalize upon: We can build a pipeline depending on the structure of the input (an interesting read BTW: https://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of).

RKSimon resigned from this revision.Mar 28 2021, 4:31 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2021, 4:31 AM

Revision Contents

Path

Size

llvm/

tools/

llvm-mca/

include/

Context.h

16 lines

Stages/

FetchStage.h

16 lines

lib/

Context.cpp

3 lines

Stages/

FetchStage.cpp

33 lines

llvm-mca.cpp

23 lines

Diff 168914

llvm/tools/llvm-mca/include/Context.h

	Show All 17 Lines
	#ifndef LLVM_TOOLS_LLVM_MCA_CONTEXT_H			#ifndef LLVM_TOOLS_LLVM_MCA_CONTEXT_H
	#define LLVM_TOOLS_LLVM_MCA_CONTEXT_H			#define LLVM_TOOLS_LLVM_MCA_CONTEXT_H
	#include "HardwareUnits/HardwareUnit.h"			#include "HardwareUnits/HardwareUnit.h"
	#include "InstrBuilder.h"			#include "InstrBuilder.h"
	#include "Pipeline.h"			#include "Pipeline.h"
	#include "SourceMgr.h"			#include "SourceMgr.h"
	#include "llvm/MC/MCRegisterInfo.h"			#include "llvm/MC/MCRegisterInfo.h"
	#include "llvm/MC/MCSubtargetInfo.h"			#include "llvm/MC/MCSubtargetInfo.h"
				#include "llvm/MC/MCCodeEmitter.h"
	#include <memory>			#include <memory>

	namespace mca {			namespace mca {

	/// This is a convenience struct to hold the parameters necessary for creating			/// This is a convenience struct to hold the parameters necessary for creating
	/// the pre-built "default" out-of-order pipeline.			/// the pre-built "default" out-of-order pipeline.
	struct PipelineOptions {			struct PipelineOptions {
	PipelineOptions(unsigned DW, unsigned RFS, unsigned LQS, unsigned SQS,			PipelineOptions(unsigned MBPC, unsigned DW, unsigned RFS, unsigned LQS,
	bool NoAlias)			unsigned SQS, bool NoAlias)
	: DispatchWidth(DW), RegisterFileSize(RFS), LoadQueueSize(LQS),			: MaxBytesFetchedPerCycle(MBPC), DispatchWidth(DW), RegisterFileSize(RFS),
	StoreQueueSize(SQS), AssumeNoAlias(NoAlias) {}			LoadQueueSize(LQS), StoreQueueSize(SQS), AssumeNoAlias(NoAlias) {}
				unsigned MaxBytesFetchedPerCycle;
	unsigned DispatchWidth;			unsigned DispatchWidth;
	unsigned RegisterFileSize;			unsigned RegisterFileSize;
	unsigned LoadQueueSize;			unsigned LoadQueueSize;
	unsigned StoreQueueSize;			unsigned StoreQueueSize;
	bool AssumeNoAlias;			bool AssumeNoAlias;
	};			};

	class Context {			class Context {
	llvm::SmallVector<std::unique_ptr<HardwareUnit>, 4> Hardware;			llvm::SmallVector<std::unique_ptr<HardwareUnit>, 4> Hardware;
	const llvm::MCRegisterInfo &MRI;			const llvm::MCRegisterInfo &MRI;
	const llvm::MCSubtargetInfo &STI;			const llvm::MCSubtargetInfo &STI;
				const llvm::MCCodeEmitter &MCE;

	public:			public:
	Context(const llvm::MCRegisterInfo &R, const llvm::MCSubtargetInfo &S)			Context(const llvm::MCRegisterInfo &R, const llvm::MCSubtargetInfo &S,
	: MRI(R), STI(S) {}			const llvm::MCCodeEmitter &MCE)
				: MRI(R), STI(S), MCE(MCE) {}
	Context(const Context &C) = delete;			Context(const Context &C) = delete;
	Context &operator=(const Context &C) = delete;			Context &operator=(const Context &C) = delete;

	void addHardwareUnit(std::unique_ptr<HardwareUnit> H) {			void addHardwareUnit(std::unique_ptr<HardwareUnit> H) {
	Hardware.push_back(std::move(H));			Hardware.push_back(std::move(H));
	}			}

	/// Construct a basic pipeline for simulating an out-of-order pipeline.			/// Construct a basic pipeline for simulating an out-of-order pipeline.
	/// This pipeline consists of Fetch, Dispatch, Execute, and Retire stages.			/// This pipeline consists of Fetch, Dispatch, Execute, and Retire stages.
	std::unique_ptr<Pipeline> createDefaultPipeline(const PipelineOptions &Opts,			std::unique_ptr<Pipeline> createDefaultPipeline(const PipelineOptions &Opts,
	InstrBuilder &IB,			InstrBuilder &IB,
	SourceMgr &SrcMgr);			SourceMgr &SrcMgr);
	};			};

	} // namespace mca			} // namespace mca
	#endif // LLVM_TOOLS_LLVM_MCA_CONTEXT_H			#endif // LLVM_TOOLS_LLVM_MCA_CONTEXT_H

llvm/tools/llvm-mca/include/Stages/FetchStage.h

	Show All 13 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_TOOLS_LLVM_MCA_FETCH_STAGE_H			#ifndef LLVM_TOOLS_LLVM_MCA_FETCH_STAGE_H
	#define LLVM_TOOLS_LLVM_MCA_FETCH_STAGE_H			#define LLVM_TOOLS_LLVM_MCA_FETCH_STAGE_H

	#include "InstrBuilder.h"			#include "InstrBuilder.h"
	#include "SourceMgr.h"			#include "SourceMgr.h"
	#include "Stages/Stage.h"			#include "Stages/Stage.h"
				#include "llvm/MC/MCCodeEmitter.h"
	#include <map>			#include <map>

	namespace mca {			namespace mca {

	class FetchStage final : public Stage {			class FetchStage final : public Stage {
	std::unique_ptr<Instruction> CurrentInstruction;			std::unique_ptr<Instruction> CurrentInstruction;
	using InstMap = std::map<unsigned, std::unique_ptr<Instruction>>;			using InstMap = std::map<unsigned, std::unique_ptr<Instruction>>;
	InstMap Instructions;			InstMap Instructions;
	InstrBuilder &IB;			InstrBuilder &IB;
	SourceMgr &SM;			SourceMgr &SM;
				const llvm::MCSubtargetInfo &STI;
				const llvm::MCCodeEmitter& MCE;
				int MaxBytesFetchedPerCycle;
				int BytesFetchedThisCycle;

	// Updates the program counter, and sets 'CurrentInstruction'.			// Updates the program counter, and sets 'CurrentInstruction'.
	llvm::Error getNextInstruction();			llvm::Error getNextInstruction();

				// Determines the length that an instruction has when encoded.
				int getEncodedLength(const llvm::MCInst& MCI);

	FetchStage(const FetchStage &Other) = delete;			FetchStage(const FetchStage &Other) = delete;
	FetchStage &operator=(const FetchStage &Other) = delete;			FetchStage &operator=(const FetchStage &Other) = delete;

	public:			public:
	FetchStage(InstrBuilder &IB, SourceMgr &SM)			FetchStage(InstrBuilder &IB, SourceMgr &SM,
	: CurrentInstruction(), IB(IB), SM(SM) {}			const llvm::MCSubtargetInfo &STI,
				const llvm::MCCodeEmitter& MCE,
				int MaxBytesFetchedPerCycle)
				: CurrentInstruction(), IB(IB), SM(SM), STI(STI), MCE(MCE),
				MaxBytesFetchedPerCycle(MaxBytesFetchedPerCycle) {}

	bool isAvailable(const InstRef &IR) const override;			bool isAvailable(const InstRef &IR) const override;
	bool hasWorkToComplete() const override;			bool hasWorkToComplete() const override;
	llvm::Error execute(InstRef &IR) override;			llvm::Error execute(InstRef &IR) override;
	llvm::Error cycleStart() override;			llvm::Error cycleStart() override;
	llvm::Error cycleEnd() override;			llvm::Error cycleEnd() override;
	};			};

	} // namespace mca			} // namespace mca

	#endif // LLVM_TOOLS_LLVM_MCA_FETCH_STAGE_H			#endif // LLVM_TOOLS_LLVM_MCA_FETCH_STAGE_H

llvm/tools/llvm-mca/lib/Context.cpp

Show All 35 Lines	Context::createDefaultPipeline(const PipelineOptions &Opts, InstrBuilder &IB,
// Create the hardware units defining the backend.		// Create the hardware units defining the backend.
auto RCU = llvm::make_unique<RetireControlUnit>(SM);		auto RCU = llvm::make_unique<RetireControlUnit>(SM);
auto PRF = llvm::make_unique<RegisterFile>(SM, MRI, Opts.RegisterFileSize);		auto PRF = llvm::make_unique<RegisterFile>(SM, MRI, Opts.RegisterFileSize);
auto LSU = llvm::make_unique<LSUnit>(Opts.LoadQueueSize, Opts.StoreQueueSize,		auto LSU = llvm::make_unique<LSUnit>(Opts.LoadQueueSize, Opts.StoreQueueSize,
Opts.AssumeNoAlias);		Opts.AssumeNoAlias);
auto HWS = llvm::make_unique<Scheduler>(SM, LSU.get());		auto HWS = llvm::make_unique<Scheduler>(SM, LSU.get());

// Create the pipeline stages.		// Create the pipeline stages.
auto Fetch = llvm::make_unique<FetchStage>(IB, SrcMgr);		auto Fetch = llvm::make_unique<FetchStage>(IB, SrcMgr, STI, MCE,
		Opts.MaxBytesFetchedPerCycle);
auto Dispatch = llvm::make_unique<DispatchStage>(STI, MRI, Opts.DispatchWidth,		auto Dispatch = llvm::make_unique<DispatchStage>(STI, MRI, Opts.DispatchWidth,
RCU, PRF);		RCU, PRF);
auto Execute = llvm::make_unique<ExecuteStage>(*HWS);		auto Execute = llvm::make_unique<ExecuteStage>(*HWS);
auto Retire = llvm::make_unique<RetireStage>(RCU, PRF);		auto Retire = llvm::make_unique<RetireStage>(RCU, PRF);

// Pass the ownership of all the hardware units to this Context.		// Pass the ownership of all the hardware units to this Context.
addHardwareUnit(std::move(RCU));		addHardwareUnit(std::move(RCU));
addHardwareUnit(std::move(PRF));		addHardwareUnit(std::move(PRF));
Show All 13 Lines

llvm/tools/llvm-mca/lib/Stages/FetchStage.cpp

	//===---------------------- FetchStage.cpp ----------------------- C++ --===//			//===---------------------- FetchStage.cpp ----------------------- C++ --===//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	/// \file			/// \file
	///			///
	/// This file defines the Fetch stage of an instruction pipeline. Its sole			/// This file defines the Fetch stage of an instruction pipeline. Its sole
	/// purpose in life is to produce instructions for the rest of the pipeline.			/// purpose in life is to produce instructions for the rest of the pipeline.
	///			///
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "Stages/FetchStage.h"			#include "Stages/FetchStage.h"
				#include "llvm/MC/MCFixup.h"

	namespace mca {			namespace mca {

	bool FetchStage::hasWorkToComplete() const {			bool FetchStage::hasWorkToComplete() const {
	return CurrentInstruction.get() \|\| SM.hasNext();			return CurrentInstruction.get() \|\| SM.hasNext();
	}			}

	bool FetchStage::isAvailable(const InstRef & /* unused */) const {			bool FetchStage::isAvailable(const InstRef & /* unused */) const {
	if (!CurrentInstruction)			if (!CurrentInstruction)
	return false;			return false;
	assert(SM.hasNext() && "Unexpected internal state!");			assert(SM.hasNext() && "Unexpected internal state!");
	const SourceRef SR = SM.peekNext();			const SourceRef SR = SM.peekNext();
	InstRef IR(SR.first, CurrentInstruction.get());			InstRef IR(SR.first, CurrentInstruction.get());
	return checkNextStage(IR);			return checkNextStage(IR);
	}			}

	llvm::Error FetchStage::getNextInstruction() {			llvm::Error FetchStage::getNextInstruction() {
	assert(!CurrentInstruction && "There is already an instruction to process!");			assert(!CurrentInstruction && "There is already an instruction to process!");
	if (!SM.hasNext())			if (!SM.hasNext())
	return llvm::ErrorSuccess();			return llvm::ErrorSuccess();
	const SourceRef SR = SM.peekNext();			const SourceRef SR = SM.peekNext();

				// Limit the number of bytes that can be fetched in one cycle. For example,
				// see the Intel Optimization Reference Manual section 2.5.2.2 ("Instruction
				// Fetch Unit"), which documents a 16 byte limit per fetch.
				if (MaxBytesFetchedPerCycle) {
				int EncodedLength = getEncodedLength(*SR.second);
				assert((EncodedLength <= MaxBytesFetchedPerCycle)
				&& "Instruction larger than maximum fetch size!");

				BytesFetchedThisCycle += EncodedLength;
				if (BytesFetchedThisCycle > MaxBytesFetchedPerCycle)
				return llvm::ErrorSuccess();
				}

	llvm::Expected<std::unique_ptr<Instruction>> InstOrErr =			llvm::Expected<std::unique_ptr<Instruction>> InstOrErr =
	IB.createInstruction(*SR.second);			IB.createInstruction(*SR.second);
	if (!InstOrErr)			if (!InstOrErr)
	return InstOrErr.takeError();			return InstOrErr.takeError();
	CurrentInstruction = std::move(InstOrErr.get());			CurrentInstruction = std::move(InstOrErr.get());
	return llvm::ErrorSuccess();			return llvm::ErrorSuccess();
	}			}

				class length_counting_ostream : public llvm::raw_ostream {
				public:
				int CurrentPos = 0;

				length_counting_ostream() : llvm::raw_ostream(/Unbuffered=/ true) {}

				void write_impl(const char*, size_t Size) override { CurrentPos += Size; }
				uint64_t current_pos() const override { return CurrentPos; }
				};

				int FetchStage::getEncodedLength(const llvm::MCInst& MCI) {
				length_counting_ostream LCOS;
				llvm::SmallVector<llvm::MCFixup, 1> Fixups;
				MCE.encodeInstruction(MCI, LCOS, Fixups, STI);
				return LCOS.current_pos();
				}

	llvm::Error FetchStage::execute(InstRef & /unused /) {			llvm::Error FetchStage::execute(InstRef & /unused /) {
	assert(CurrentInstruction && "There is no instruction to process!");			assert(CurrentInstruction && "There is no instruction to process!");
	const SourceRef SR = SM.peekNext();			const SourceRef SR = SM.peekNext();
	InstRef IR(SR.first, CurrentInstruction.get());			InstRef IR(SR.first, CurrentInstruction.get());
	assert(checkNextStage(IR) && "Invalid fetch!");			assert(checkNextStage(IR) && "Invalid fetch!");

	Instructions[IR.getSourceIndex()] = std::move(CurrentInstruction);			Instructions[IR.getSourceIndex()] = std::move(CurrentInstruction);
	if (llvm::Error Val = moveToTheNextStage(IR))			if (llvm::Error Val = moveToTheNextStage(IR))
	return Val;			return Val;

	SM.updateNext();			SM.updateNext();

	// Move the program counter.			// Move the program counter.
	return getNextInstruction();			return getNextInstruction();
	}			}

	llvm::Error FetchStage::cycleStart() {			llvm::Error FetchStage::cycleStart() {
				BytesFetchedThisCycle = 0;
	if (!CurrentInstruction)			if (!CurrentInstruction)
	return getNextInstruction();			return getNextInstruction();
	return llvm::ErrorSuccess();			return llvm::ErrorSuccess();
	}			}

	llvm::Error FetchStage::cycleEnd() {			llvm::Error FetchStage::cycleEnd() {
	// Find the first instruction which hasn't been retired.			// Find the first instruction which hasn't been retired.
	const InstMap::iterator It =			const InstMap::iterator It =
	Show All 12 Lines

llvm/tools/llvm-mca/llvm-mca.cpp

Show First 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	OutputAsmVariant("output-asm-variant",
cl::desc("Syntax variant to use for output printing"),		cl::desc("Syntax variant to use for output printing"),
cl::cat(ToolOptions), cl::init(-1));		cl::cat(ToolOptions), cl::init(-1));

static cl::opt<unsigned> Iterations("iterations",		static cl::opt<unsigned> Iterations("iterations",
cl::desc("Number of iterations to run"),		cl::desc("Number of iterations to run"),
cl::cat(ToolOptions), cl::init(0));		cl::cat(ToolOptions), cl::init(0));

static cl::opt<unsigned>		static cl::opt<unsigned>
		MaxBytesFetchedPerCycle("bytes-per-cycle",
		cl::desc("The maximum number of bytes worth of "
		"instructions that can be fetched in one "
		"cycle."),
		cl::cat(ToolOptions), cl::init(0));

		static cl::opt<unsigned>
DispatchWidth("dispatch", cl::desc("Override the processor dispatch width"),		DispatchWidth("dispatch", cl::desc("Override the processor dispatch width"),
cl::cat(ToolOptions), cl::init(0));		cl::cat(ToolOptions), cl::init(0));

static cl::opt<unsigned>		static cl::opt<unsigned>
RegisterFileSize("register-file-size",		RegisterFileSize("register-file-size",
cl::desc("Maximum number of physical registers which can "		cl::desc("Maximum number of physical registers which can "
"be used for register mappings"),		"be used for register mappings"),
cl::cat(ToolOptions), cl::init(0));		cl::cat(ToolOptions), cl::init(0));
▲ Show 20 Lines • Show All 263 Lines • ▼ Show 20 Lines	int main(int argc, char **argv) {
// Apply overrides to llvm-mca specific options.		// Apply overrides to llvm-mca specific options.
processViewOptions();		processViewOptions();

SourceMgr SrcMgr;		SourceMgr SrcMgr;

// Tell SrcMgr about this buffer, which is what the parser will pick up.		// Tell SrcMgr about this buffer, which is what the parser will pick up.
SrcMgr.AddNewSourceBuffer(std::move(*BufferPtr), SMLoc());		SrcMgr.AddNewSourceBuffer(std::move(*BufferPtr), SMLoc());

		std::unique_ptr<MCInstrInfo> MCII(TheTarget->createMCInstrInfo());

std::unique_ptr<MCRegisterInfo> MRI(TheTarget->createMCRegInfo(TripleName));		std::unique_ptr<MCRegisterInfo> MRI(TheTarget->createMCRegInfo(TripleName));
assert(MRI && "Unable to create target register info!");		assert(MRI && "Unable to create target register info!");

std::unique_ptr<MCAsmInfo> MAI(TheTarget->createMCAsmInfo(*MRI, TripleName));		std::unique_ptr<MCAsmInfo> MAI(TheTarget->createMCAsmInfo(*MRI, TripleName));
assert(MAI && "Unable to create target asm info!");		assert(MAI && "Unable to create target asm info!");

MCObjectFileInfo MOFI;		MCObjectFileInfo MOFI;
MCContext Ctx(MAI.get(), MRI.get(), &MOFI, &SrcMgr);		MCContext Ctx(MAI.get(), MRI.get(), &MOFI, &SrcMgr);

		MCCodeEmitter* MCE = TheTarget->createMCCodeEmitter(MCII, MRI, Ctx);

MOFI.InitMCObjectFileInfo(TheTriple, /* PIC= */ false, Ctx);		MOFI.InitMCObjectFileInfo(TheTriple, /* PIC= */ false, Ctx);

std::unique_ptr<buffer_ostream> BOS;		std::unique_ptr<buffer_ostream> BOS;

mca::CodeRegions Regions(SrcMgr);		mca::CodeRegions Regions(SrcMgr);
MCStreamerWrapper Str(Ctx, Regions);		MCStreamerWrapper Str(Ctx, Regions);

std::unique_ptr<MCInstrInfo> MCII(TheTarget->createMCInstrInfo());

std::unique_ptr<MCInstrAnalysis> MCIA(		std::unique_ptr<MCInstrAnalysis> MCIA(
TheTarget->createMCInstrAnalysis(MCII.get()));		TheTarget->createMCInstrAnalysis(MCII.get()));

if (!MCPU.compare("native"))		if (!MCPU.compare("native"))
MCPU = llvm::sys::getHostCPUName();		MCPU = llvm::sys::getHostCPUName();

std::unique_ptr<MCSubtargetInfo> STI(		std::unique_ptr<MCSubtargetInfo> STI(
TheTarget->createMCSubtargetInfo(TripleName, MCPU, /* FeaturesStr */ ""));		TheTarget->createMCSubtargetInfo(TripleName, MCPU, /* FeaturesStr */ ""));
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	int main(int argc, char **argv) {
unsigned Width = SM.IssueWidth;		unsigned Width = SM.IssueWidth;
if (DispatchWidth)		if (DispatchWidth)
Width = DispatchWidth;		Width = DispatchWidth;

// Create an instruction builder.		// Create an instruction builder.
mca::InstrBuilder IB(STI, MCII, MRI, MCIA, *IP);		mca::InstrBuilder IB(STI, MCII, MRI, MCIA, *IP);

// Create a context to control ownership of the pipeline hardware.		// Create a context to control ownership of the pipeline hardware.
mca::Context MCA(MRI, STI);		mca::Context MCA(MRI, STI, *MCE);

mca::PipelineOptions PO(Width, RegisterFileSize, LoadQueueSize,		mca::PipelineOptions PO(MaxBytesFetchedPerCycle, Width, RegisterFileSize,
StoreQueueSize, AssumeNoAlias);		LoadQueueSize, StoreQueueSize, AssumeNoAlias);

// Number each region in the sequence.		// Number each region in the sequence.
unsigned RegionIdx = 0;		unsigned RegionIdx = 0;
for (const std::unique_ptr<mca::CodeRegion> &Region : Regions) {		for (const std::unique_ptr<mca::CodeRegion> &Region : Regions) {
// Skip empty code regions.		// Skip empty code regions.
if (Region->empty())		if (Region->empty())
continue;		continue;

// Don't print the header of this region if it is the default region, and		// Don't print the header of this region if it is the default region, and
// it doesn't have an end location.		// it doesn't have an end location.
if (Region->startLoc().isValid() \|\| Region->endLoc().isValid()) {		if (Region->startLoc().isValid() \|\| Region->endLoc().isValid()) {
TOF->os() << "\n[" << RegionIdx++ << "] Code Region";		TOF->os() << "\n[" << RegionIdx++ << "] Code Region";
StringRef Desc = Region->getDescription();		StringRef Desc = Region->getDescription();
if (!Desc.empty())		if (!Desc.empty())
TOF->os() << " - " << Desc;		TOF->os() << " - " << Desc;
TOF->os() << "\n\n";		TOF->os() << "\n\n";
}		}

mca::SourceMgr S(Region->getInstructions(),		mca::SourceMgr S(Region->getInstructions(),
PrintInstructionTables ? 1 : Iterations);		PrintInstructionTables ? 1 : Iterations);

if (PrintInstructionTables) {		if (PrintInstructionTables) {
// Create a pipeline, stages, and a printer.		// Create a pipeline, stages, and a printer.
auto P = llvm::make_unique<mca::Pipeline>();		auto P = llvm::make_unique<mca::Pipeline>();
P->appendStage(llvm::make_unique<mca::FetchStage>(IB, S));		P->appendStage(llvm::make_unique<mca::FetchStage>(
		IB, S, STI, MCE, MaxBytesFetchedPerCycle));
P->appendStage(llvm::make_unique<mca::InstructionTables>(SM, IB));		P->appendStage(llvm::make_unique<mca::InstructionTables>(SM, IB));
mca::PipelinePrinter Printer(*P);		mca::PipelinePrinter Printer(*P);

// Create the views for this pipeline, execute, and emit a report.		// Create the views for this pipeline, execute, and emit a report.
if (PrintInstructionInfoView) {		if (PrintInstructionInfoView) {
Printer.addView(		Printer.addView(
llvm::make_unique<mca::InstructionInfoView>(STI, MCII, S, *IP));		llvm::make_unique<mca::InstructionInfoView>(STI, MCII, S, *IP));
}		}
▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines