This is an archive of the discontinued LLVM Phabricator instance.

[MCA] Limit the number of bytes fetched per cycle.
Needs ReviewPublic

Authored by orodley on Oct 9 2018, 5:06 PM.

Details

Summary

For an example of this, see the Intel Optimization Reference Manual,
2.5.2.2 ("Instruction Fetch Unit"):

An instruction fetch is a 16-byte aligned lookup through the ITLB into
the instruction cache and instruction prefetch buffers. A hit in the
instruction cache causes 16 bytes to be delivered to the instruction
predecoder.

The model here (fetching exactly 16 bytes every cycle) is incomplete, as
it assumes every instruction is aligned and in icache, but it's an
improvement.

This is usually not a bottleneck, as stalls elsewhere will hide it, but
it can make a difference in some cases. Further work would be to
classify these cases in the report output - if there are unutilised
issue slots while there is no backend stall, we are frontend bound.

Diff Detail

Event Timeline

orodley created this revision.Oct 9 2018, 5:06 PM
orodley retitled this revision from [MCA] Constrain the number of bytes fetched per cycle. to [MCA] Limit the number of bytes fetched per cycle..Oct 9 2018, 7:15 PM

This should eventually be configurable in the sched target, right?
Also, a single fetch 'queue' also may not fully fit everyone.
https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf

While previous AMD64 family 15h processors had a single 32-byte fetch window, AMD Family 15h,
models 30h–4Fh processors have two 32-byte fetch windows, from which three micro-ops can be
selected.

andreadb added a subscriber: courbet.

Hi Owen,

The default pipeline in llvm-mca doesn't simulate any hardware frontend logic.

The Fetch Stage in llvm-mca is only responsible for creating instructions and moving them to the next pipeline stage.
It doesn't have to be confused with the Fetch logic in the hardware frontend, which - as you wrote - is responsible for fetching portions of a cache line every cycle, and feed them to the decoders via an instruction byte queue.

The llvm-mca Fetch stage is equivalent to an unbounded queue of already decoded instructions. Instructions from every iteration are immediately available at cycle 0.

About the topic of simulating the hardware frontend logic:

There is already bug https://bugs.llvm.org/show_bug.cgi?id=36665, which is about adding support for simulating the hardware frontend logic.
I know that @courbet and his team would like to work on it. So, you can probably try to work with them on this.
Unfortunately, that bugzilla must be updated. There is not enough information there (I suggested to send a detailed RFC upstream in case).

I strongly suggest you/your team/Clement's team to work together on that task. I am afraid that people may be working on the same tasks in parallel.. That has to be avoided.
You can use that bugzilla to coordinate your work upsteam on this.

Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
For now, any frontend simulation should be implemented by stages that are not part of the default pipeline. The default pipeline should only stay focused on simulating the hardware backend logic.

In future, Subtargets will be able to declare custom pipelines via tablegen.
The default pipeline would only be be used by subtargets that don't provide/specify their own custom pipeline.
That being said, this is a long term goal, and it has to be properly discussed in an RFC. It will also affect the future of PR36665.

-Andrea

Hi Andrea,

There is already bug https://bugs.llvm.org/show_bug.cgi?id=36665, which is about adding support for simulating the hardware frontend logic.
I know that @courbet and his team would like to work on it. So, you can probably try to work with them on this.
Unfortunately, that bugzilla must be updated. There is not enough information there (I suggested to send a detailed RFC upstream in case).

I strongly suggest you/your team/Clement's team to work together on that task. I am afraid that people may be working on the same tasks in parallel.. That has to be avoided.
You can use that bugzilla to coordinate your work upsteam on this.

Let me clarify this: Owen is working with us :) He has taken over the genetic scheduler work I presented at EuroLLVM. One of the bottlenecks we had was the frontend hence the change. I agree that this should have been made clearer (@owenrodley, can you create a bugzilla account and assign the bug to yourself ?)

The default pipeline in llvm-mca doesn't simulate any hardware frontend logic.

The Fetch Stage in llvm-mca is only responsible for creating instructions and moving them to the next pipeline stage.
It doesn't have to be confused with the Fetch logic in the hardware frontend, which - as you wrote - is responsible for fetching portions of a cache line every cycle, and feed them to the decoders via an instruction byte queue.

The llvm-mca Fetch stage is equivalent to an unbounded queue of already decoded instructions. Instructions from every iteration are immediately available at cycle 0.

All of this sounds more like a naming issue than an issue about what Owen is trying to implement.
Maybe we could rename FetchState into DecodedStage or something like this ?

Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
For now, any frontend simulation should be implemented by stages that are not part of the default pipeline. The default pipeline should only stay focused on simulating the hardware backend logic.

I guess you meant to say: "the default pipeline should only stay focused on simulating what is modeled in the MCSchedModel" ? If we can carve out something that is common to all frontends, then it could end up in MCSchedModel, and then be in the default llvm-mca pipeline. (BTW the resource pointed to by Roman shows that the approach here might not be generic enough).
Until then it feels like the flag is a low-cost approach to implementing this.

Hi Andrea,

There is already bug https://bugs.llvm.org/show_bug.cgi?id=36665, which is about adding support for simulating the hardware frontend logic.
I know that @courbet and his team would like to work on it. So, you can probably try to work with them on this.
Unfortunately, that bugzilla must be updated. There is not enough information there (I suggested to send a detailed RFC upstream in case).

I strongly suggest you/your team/Clement's team to work together on that task. I am afraid that people may be working on the same tasks in parallel.. That has to be avoided.
You can use that bugzilla to coordinate your work upsteam on this.

Let me clarify this: Owen is working with us :) He has taken over the genetic scheduler work I presented at EuroLLVM. One of the bottlenecks we had was the frontend hence the change. I agree that this should have been made clearer (@owenrodley, can you create a bugzilla account and assign the bug to yourself ?)

Okay. Good to know that there is no overlap :-).

The default pipeline in llvm-mca doesn't simulate any hardware frontend logic.

The Fetch Stage in llvm-mca is only responsible for creating instructions and moving them to the next pipeline stage.
It doesn't have to be confused with the Fetch logic in the hardware frontend, which - as you wrote - is responsible for fetching portions of a cache line every cycle, and feed them to the decoders via an instruction byte queue.

The llvm-mca Fetch stage is equivalent to an unbounded queue of already decoded instructions. Instructions from every iteration are immediately available at cycle 0.

All of this sounds more like a naming issue than an issue about what Owen is trying to implement.
Maybe we could rename FetchState into DecodedStage or something like this ?

The "FetchStage" is literally just there to create instructions and move them to the next stage.

It is just an artificial entrypoint stage that acts as a "data source" (where data is instructions). It doesn't try to model any hardware frontend concepts.

Its original name was "InstructionFormationStage". We ended up calling it "FetchStage"; in retrospect, it was not a good name because it causes ambiguity.
We may revert to that name if you like.

Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
For now, any frontend simulation should be implemented by stages that are not part of the default pipeline. The default pipeline should only stay focused on simulating the hardware backend logic.

I guess you meant to say: "the default pipeline should only stay focused on simulating what is modeled in the MCSchedModel" ? If we can carve out something that is common to all frontends, then it could end up in MCSchedModel, and then be in the default llvm-mca pipeline. (BTW the resource pointed to by Roman shows that the approach here might not be generic enough).
Until then it feels like the flag is a low-cost approach to implementing this.

We shouldn't make any changes to the current FetchStage.
Changes to other stages of the default pipeline are okay as long as we can demonstrate that those are beneficial for all the simulated processors.

Any other big change should require an RFC. The introduction of a new stage in the default pipeline should also be justified as it affects both simulation time, and potentially the quality of the analysis.

Essentially, I want to see an RFC on how your team wants to model the frontend simulation.
There are too many aspects that cannot be accurately modelled by a static analysis tool: branch prediction, loop buffers, different decoding paths, decoders with different capabilities, instruction byte windows, instruction decoder's queue,etc.
If we want to do that as part of the default pipeline, then we have to be extremely careful and do it right.

If we don't describe the hardware frontend correctly, we risk to pessimize the analysis rather than improve it. If we decide to add it to the default pipeline, then - at least to start - it should be opt-in for the targets (stages are not added unless the scheduling model for that subtarget explicitly ask for them).

About this patch:
the number of bytes fetched is not meaningful for the current "FetchStage".
The "FetchStage" doesn't/shouldn't care about how many bytes an instruction has. More importantly, our "intermediate form" is an already decoded instruction; the whole idea of checking how many bytes an instruction is at that stage is odd. We don't simulate the hardware frontend in the default pipeline (at least, not for now).
You need a separate stage for that. So, for now, sorry but I don't think it is a good compromise.

mattd added a subscriber: mattd.Oct 11 2018, 11:54 AM

The "FetchStage" is literally just there to create instructions and move them to the next stage.

It is just an artificial entrypoint stage that acts as a "data source" (where data is instructions). It doesn't try to model any hardware frontend concepts.

Its original name was "InstructionFormationStage". We ended up calling it "FetchStage"; in retrospect, it was not a good name because it causes ambiguity.
We may revert to that name if you like.

Yes, I think that would make sense.

Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
For now, any frontend simulation should be implemented by stages that are not part of the default pipeline.

I don't have a particular opinion about whether this should be part of the default pipeline or not, but I think modeling the frontend is very important.
This article from a couple years ago analyzes the typical workloads on a google datacenter. While most of the stalls are from the backend, the frontend has a significant contribution:

"Front-end core stalls account for 15-30% of all pipeline slots, with many workloads showing 5-10% of cycles completely starved on instructions".

The authors found this trend to be increasing over time. The article shows that i-cache misses are a large part of these stalls, which is going to be hard to model statically.
However, we also found out that large computation kernels typically had frontend stalls in fetch&decode due to the large size of vector instructions (the Intel fetch window is 16 bytes). We had some nice wins based on llvm_sim, which does simulate the frontend. We've made two of these wins public
(https://github.com/webmproject/libwebp/commit/67748b41dbb21a43e88f2b6ddf6117f4338873a3, https://github.com/google/gemmlowp/pull/91).

I think we agreed during EuroLLVM last year that we should standardize on a single tool to avoid duplicating effort, and that that this tool should be llvm-mca. That means that over time llvm-mca needs to grow to support more use cases.

The default pipeline should only stay focused on simulating the hardware backend logic.

I guess you meant to say: "the default pipeline should only stay focused on simulating what is modeled in the MCSchedModel" ? If we can carve out something that is common to all frontends, then it could end up in MCSchedModel, and then be in the default llvm-mca pipeline. (BTW the resource pointed to by Roman shows that the approach here might not be generic enough).
Until then it feels like the flag is a low-cost approach to implementing this.

We shouldn't make any changes to the current FetchStage.
Changes to other stages of the default pipeline are okay as long as we can demonstrate that those are beneficial for all the simulated processors.

I think that's fair. How would you feel about moving the change in this patch into a separate stage ? The flag would then turn on adding the stage so that we can experiment with various CPUs. If that turns out to be useful, we can discuss adding fetch modeling to the MCSchedModel, which will then allow us to add this to the default pipeline in a principled way.

Any other big change should require an RFC. The introduction of a new stage in the default pipeline should also be justified as it affects both simulation time, and potentially the quality of the analysis.

Essentially, I want to see an RFC on how your team wants to model the frontend simulation.
There are too many aspects that cannot be accurately modelled by a static analysis tool: branch prediction, loop buffers, different decoding paths, decoders with different capabilities, instruction byte windows, instruction decoder's queue,etc.
If we want to do that as part of the default pipeline, then we have to be extremely careful and do it right.

If we don't describe the hardware frontend correctly, we risk to pessimize the analysis rather than improve it. If we decide to add it to the default pipeline, then - at least to start - it should be opt-in for the targets (stages are not added unless the scheduling model for that subtarget explicitly ask for them).

About this patch:
the number of bytes fetched is not meaningful for the current "FetchStage".
The "FetchStage" doesn't/shouldn't care about how many bytes an instruction has. More importantly, our "intermediate form" is an already decoded instruction; the whole idea of checking how many bytes an instruction is at that stage is odd. We don't simulate the hardware frontend in the default pipeline (at least, not for now).
You need a separate stage for that. So, for now, sorry but I don't think it is a good compromise.

Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
For now, any frontend simulation should be implemented by stages that are not part of the default pipeline.

I don't have a particular opinion about whether this should be part of the default pipeline or not, but I think modeling the frontend is very important.
This article from a couple years ago analyzes the typical workloads on a google datacenter. While most of the stalls are from the backend, the frontend has a significant contribution:

I think that we are on the same page.
I have nothing against having frontend analysis: we want to be able to identify frontend bottlenecks.
My point was more about the "development process". I think we need to agree on a plan, and have a good roadmap. It is difficult to evaluate small incremental patches like this if we don't have a "vision". We should have at least an idea on what will be the next steps.

"Front-end core stalls account for 15-30% of all pipeline slots, with many workloads showing 5-10% of cycles completely starved on instructions".

The authors found this trend to be increasing over time. The article shows that i-cache misses are a large part of these stalls, which is going to be hard to model statically.
However, we also found out that large computation kernels typically had frontend stalls in fetch&decode due to the large size of vector instructions (the Intel fetch window is 16 bytes). We had some nice wins based on llvm_sim, which does simulate the frontend. We've made two of these wins public
(https://github.com/webmproject/libwebp/commit/67748b41dbb21a43e88f2b6ddf6117f4338873a3, https://github.com/google/gemmlowp/pull/91).

I think we agreed during EuroLLVM last year that we should standardize on a single tool to avoid duplicating effort, and that that this tool should be llvm-mca. That means that over time llvm-mca needs to grow to support more use cases.

The default pipeline should only stay focused on simulating the hardware backend logic.

I guess you meant to say: "the default pipeline should only stay focused on simulating what is modeled in the MCSchedModel" ? If we can carve out something that is common to all frontends, then it could end up in MCSchedModel, and then be in the default llvm-mca pipeline. (BTW the resource pointed to by Roman shows that the approach here might not be generic enough).
Until then it feels like the flag is a low-cost approach to implementing this.

We shouldn't make any changes to the current FetchStage.
Changes to other stages of the default pipeline are okay as long as we can demonstrate that those are beneficial for all the simulated processors.

I think that's fair. How would you feel about moving the change in this patch into a separate stage ? The flag would then turn on adding the stage so that we can experiment with various CPUs. If that turns out to be useful, we can discuss adding fetch modeling to the MCSchedModel, which will then allow us to add this to the default pipeline in a principled way.

I think that is the right way to go. It would unblock your work in the short term, and give us time to evaluate the quality of the new logic without affecting the default analysis pipeline.

This is pretty much what I was suggesting in my previous comment (i.e. have frontend logic/process defined by separate stages that runs before "DispatchStage"). The current FetchStage will be renamed (I would do that after the conference if it is not a problem...), and it would still be the first stage to run. New stages would be marked as "experimental" to start. So that those are opt-in for subtargets.

P.s.: the new stage should have the concept of cache line and alignment. So that we can experiment different alignment constraints for the input code block. My understanding is that this new Fetch stage models the interaction with an IC (instruction cache); processor models should be able to customize what portion of a cache line can be picked every cycle.

Note however that this new stage may not be always enabled if the processor implements a loop buffer.
For example, instructions may be picked from a loop buffer, and not use the legacy decoders path (where instructions are fetched from the IC first). The throughput from the loop buffer normally differs from the throughput from the decoders, and it may be subject to different constraints (i.e. not the size in bytes of an instruction).
So, I am curious to see how you plan to model those frontend aspects. We may want to have to separate simulations: one where we always assume the IC path; another where we assume instructions are always picked from a loop buffer.
In practice, the choice of whether opcodes are contributed by the legacy decoder's path or not depends on the feedback from the branch predictor, and the size of a code snippet. So, the question (not for this patch) is: how much we want to complicate the model? (that is why I was originally pushing for an RFC; I didn't mean to be annoying...). Should we care about modelling (at least a few) aspects of the branch predictor? We don't have to answer to these questions now; I just wanted to further clarify why I feel cautious when it comes to modelling the frontend logic.

-Andrea

Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
For now, any frontend simulation should be implemented by stages that are not part of the default pipeline.

I don't have a particular opinion about whether this should be part of the default pipeline or not, but I think modeling the frontend is very important.
This article from a couple years ago analyzes the typical workloads on a google datacenter. While most of the stalls are from the backend, the frontend has a significant contribution:

I think that we are on the same page.
I have nothing against having frontend analysis: we want to be able to identify frontend bottlenecks.
My point was more about the "development process". I think we need to agree on a plan, and have a good roadmap. It is difficult to evaluate small incremental patches like this if we don't have a "vision". We should have at least an idea on what will be the next steps.

"Front-end core stalls account for 15-30% of all pipeline slots, with many workloads showing 5-10% of cycles completely starved on instructions".

The authors found this trend to be increasing over time. The article shows that i-cache misses are a large part of these stalls, which is going to be hard to model statically.
However, we also found out that large computation kernels typically had frontend stalls in fetch&decode due to the large size of vector instructions (the Intel fetch window is 16 bytes). We had some nice wins based on llvm_sim, which does simulate the frontend. We've made two of these wins public
(https://github.com/webmproject/libwebp/commit/67748b41dbb21a43e88f2b6ddf6117f4338873a3, https://github.com/google/gemmlowp/pull/91).

I think we agreed during EuroLLVM last year that we should standardize on a single tool to avoid duplicating effort, and that that this tool should be llvm-mca. That means that over time llvm-mca needs to grow to support more use cases.

The default pipeline should only stay focused on simulating the hardware backend logic.

I guess you meant to say: "the default pipeline should only stay focused on simulating what is modeled in the MCSchedModel" ? If we can carve out something that is common to all frontends, then it could end up in MCSchedModel, and then be in the default llvm-mca pipeline. (BTW the resource pointed to by Roman shows that the approach here might not be generic enough).
Until then it feels like the flag is a low-cost approach to implementing this.

We shouldn't make any changes to the current FetchStage.
Changes to other stages of the default pipeline are okay as long as we can demonstrate that those are beneficial for all the simulated processors.

I think that's fair. How would you feel about moving the change in this patch into a separate stage ? The flag would then turn on adding the stage so that we can experiment with various CPUs. If that turns out to be useful, we can discuss adding fetch modeling to the MCSchedModel, which will then allow us to add this to the default pipeline in a principled way.

I think that is the right way to go. It would unblock your work in the short term, and give us time to evaluate the quality of the new logic without affecting the default analysis pipeline.

This is pretty much what I was suggesting in my previous comment (i.e. have frontend logic/process defined by separate stages that runs before "DispatchStage"). The current FetchStage will be renamed (I would do that after the conference if it is not a problem...), and it would still be the first stage to run. New stages would be marked as "experimental" to start. So that those are opt-in for subtargets.

P.s.: the new stage should have the concept of cache line and alignment. So that we can experiment different alignment constraints for the input code block. My understanding is that this new Fetch stage models the interaction with an IC (instruction cache); processor models should be able to customize what portion of a cache line can be picked every cycle.

Note however that this new stage may not be always enabled if the processor implements a loop buffer.
For example, instructions may be picked from a loop buffer, and not use the legacy decoders path (where instructions are fetched from the IC first). The throughput from the loop buffer normally differs from the throughput from the decoders, and it may be subject to different constraints (i.e. not the size in bytes of an instruction).

So, I am curious to see how you plan to model those frontend aspects. We may want to have to separate simulations: one where we always assume the IC path; another where we assume instructions are always picked from a loop buffer.
In practice, the choice of whether opcodes are contributed by the legacy decoder's path or not depends on the feedback from the branch predictor, and the size of a code snippet. So, the question (not for this patch) is: how much we want to complicate the model? (that is why I was originally pushing for an RFC; I didn't mean to be annoying...). Should we care about modelling (at least a few) aspects of the branch predictor? We don't have to answer to these questions now; I just wanted to further clarify why I feel cautious when it comes to modelling the frontend logic.

I fully agree: The approach we took in llvm_sim is to tell the simulator whether we're in a loop or not with a flag. In our implementation we always went through the legacy decoder because our goal was to improve scheduling of large blocks (because that's where rescheduling really makes a difference). And that's something that we could generalize upon: We can build a pipeline depending on the structure of the input (an interesting read BTW: https://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of).

RKSimon resigned from this revision.Mar 28 2021, 4:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2021, 4:31 AM