This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
MC/
1/2
MCSchedule.h
-
MCA/
-
Context.h
-
HardwareUnits/
-
RegisterFile.h
-
RetireControlUnit.h
-
Instruction.h
-
Stages/
1/2
InOrderIssueStage.h
-
RetireStage.h
-
Target/
1
TargetSchedule.td
-
lib/
-
MCA/
1/2
CMakeLists.txt
2/4
Context.cpp
-
HardwareUnits/
-
RetireControlUnit.cpp
-
InstrBuilder.cpp
-
Stages/
10/22
InOrderIssueStage.cpp
-
RetireStage.cpp
-
Target/AArch64/
-
AArch64/
-
AArch64SchedA55.td
-
test/tools/llvm-mca/
-
tools/
-
llvm-mca/
-
AArch64/Cortex/
-
Cortex/
-
A55-add-sequence.s
-
A55-all-stats.s
3/9
A55-all-views.s
-
A55-in-order-retire.s
-
A55-out-of-order-retire.s
-
X86/
-
in-order-cpu.s
-
tools/llvm-mca/
-
llvm-mca/
-
llvm-mca.cpp
-
utils/TableGen/
-
TableGen/
1
SubtargetEmitter.cpp

Differential D94928

[llvm-mca] Add support for in-order CPUs
ClosedPublic

Authored by asavonic on Jan 18 2021, 12:06 PM.

Download Raw Diff

Details

Reviewers

andreadb
dmgreen
evgeny777

Commits

rGd791695cb517: [MCA] Add support for in-order CPUs

Summary

This patch adds a simplified pipeline to support in-order CPUs such as
ARM Cortex-A55.

In-order pipeline implements a simplified version of Dispatch,
Scheduler and Execute stages as a single stage. Entry and Retire
stages are shared for both in-order and out-of-order pipelines.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	260 ms	x64 debian > LLVM.TableGen::InvalidMCSchedClassDesc.td
	310 ms	x64 windows > LLVM.TableGen::InvalidMCSchedClassDesc.td

Event Timeline

asavonic created this revision.Jan 18 2021, 12:06 PM

Herald added a reviewer: andreadb. · View Herald TranscriptJan 18 2021, 12:06 PM

Herald added subscribers: gbedwell, hiraditya, kristof.beyls, mgorny. · View Herald Transcript

asavonic requested review of this revision.Jan 18 2021, 12:06 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 18 2021, 12:06 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

asl added a subscriber: asl.Jan 18 2021, 12:10 PM

Harbormaster completed remote builds in B85627: Diff 317401.Jan 18 2021, 1:28 PM

RKSimon added a subscriber: RKSimon.Jan 19 2021, 12:49 AM

RKSimon added inline comments.Jan 19 2021, 3:29 AM

llvm/lib/MCA/CMakeLists.txt
21	sorting

andreadb added reviewers: dmgreen, evgeny777.Jan 19 2021, 5:59 AM

Thanks for working on this feature!

I went through the patch and overall the approach seems sound.

There are however a few questions and concerns about the overall design.
That being said, overall it shouldn't be difficult to address most of my comments.

Speaking about the design:

Your model assumes an unbounded queue of instructions (something like rudimental reservation station) where to store dispatched instructions.

Correct me if I am wrong, but in-order processor don't use a reservation station.
In the absence of structural hazards, if data dependencies are met, then uOPs are directly issued to the underlying execution units.
So the dispatch event is not decoupled from the issue event.

The fact that your patch adds an unbounded queue sounds a bit strange to me. Not sure what @dmgreen
thinks about it. But this basically means that dispatch and issue are different events.

I also noticed how there are no checks on NumMicroOps. Is there a reason why you don't check for it?
In one of the tests, the target is dual-issue. However, there are cycles where three opcodes are dispatched.
See for example the test where two loads are dispatched in a same cycle (with the first load decoded into two uOPs).
Note also that the presence of '=' chars in the timeline represents cycles spent by an instruction while waiting in a scheduler to be issued.
To be honest, I was not expecting to see those characters at all.

Btw. There is also one particular case where two instructions seems to execute out of order.

You can find other comments below.

Thanks,
-Andrea

llvm/include/llvm/MCA/Stages/InOrderIssueStage.h
42–52	Please use unsigned quantities instead of signed integers.
llvm/lib/MCA/Context.cpp
31–33	I suggest to move the support for in-order pipeline composition into a separate function. I think it would help in terms of readability.
31–40	For readability reasons, I suggest to split this method into two methods. For example, something like `if (!SM.isOutOfOrder()) createInOrderPipeline(opts, srcMgr);`. You then need to move all the pipeline composition logic for in-order processors inside `createInOrderPipeline()`. I think it would be much more readable (just my opinion though).
llvm/lib/MCA/Stages/InOrderIssueStage.cpp
35–38	We also need to check if IR "begins a group". Instructions that begin a group, must always be the first instructions to be dispatched in a cycle. See how the `isAvailable()` check is implemented by the DispatchStage.
36	Shouldn't we also add checks on `NumMicroOps` somewhere? One of the tests reports two loads dispatched within a same cycle. However one of the loads is 2 uOps, so - correct me if I am wrong - it shouldn't be possible to dispatch two of those in a same cycle. Out of curiosity: ldr w4, [x2], #4 is the resource consumption info correct for that instruction?
50	As mentioned in another comment we should use unsigned quanitites whenever possible. So, int quantities that are only used to describe zero or more cycles should really be converted into unsigned.
108–115	I expect register renaming to only occur for out-of-order processors. So move elimination shouldn't ever happen in this context. That being said, instructions that are fully eliminated at register renaming stage still have their writes propagated to the register file. So, if you want to correctly handle that case, then you cannot early exit; you still need to add writes. Note that the only instructions that can be eliminated at register renaming stage are register moves. See the definition and description of tablegen class `RegisterFile` in https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Target/TargetSchedule.td I don't expect any of this to really affect in-order targets. So, personally I suggest to replace the initial if-stmt with an assert (i.e. check tha IS is NOT eliminated).
143	Maybe add a comment here explaining why we are passing 0 as token-id. Since this is an in-order processor, the retire stage (and therefore the token id) are not really required in practice. We don't expect the retire stage to do anything in particular.
149–151	`Bandwidth` should be set to zero if IR `ends a group`. You probaby need something semantically equivalent to what method DispatchStage::dispatch(InstRef IR) does: // Check if this instructions ends the dispatch group. if (Desc.EndGroup) AvailableEntries = 0;
llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s
115–127	Interesting. According to this design, the "dispatch" event is still decoupled from the "issue" event. Is that expected? I am asking because in my mind, at least for in-order processors, the two events should coincide. Basically there is no reservation station, and uOPs are directly issued to the underlying execution unit at a IssueWidth maximum rate.
117–118	Why are these two executing out of order?

Thanks for the review Andrea!

In D94928#2506972, @andreadb wrote:

Your model assumes an unbounded queue of instructions (something like rudimental reservation station) where to store dispatched instructions.

If you mean InstQueue, then it is bounded by Bandwidth variable - the
maximum number of instructions that can be issued in the next cycle.

Correct me if I am wrong, but in-order processor don't use a reservation station.
In the absence of structural hazards, if data dependencies are met, then uOPs are directly issued to the underlying execution units.
So the dispatch event is not decoupled from the issue event.

The fact that your patch adds an unbounded queue sounds a bit strange to me. Not sure what @dmgreen
thinks about it. But this basically means that dispatch and issue are different events.

That is true. However, the problem here is that MCA timeline view counts stalls
as a number of cycles between dispatch and issue events. If dispatch and issue
always happen in the same cycle, stalls are not displayed:

[0,3]     .  DeeER  .    .    add	w13, w30, #1
[0,4]     .  DeeeER .    .    smulh	x30, x29, x28
[0,5]     .     DeeeER   .    smulh	x27, x30, x28
[0,6]     .        DeeeER.    smulh	xzr, x27, x26
[0,7]     .    .    DeeeER    umulh	x30, x29, x28

To avoid this, the implementation emits a dispatch event for instructions that
should be executed in the next cycle. If an instruction is unable to execute due
to a hazard, it is delayed and a stall is displayed starting from the dispatch
event:

[0,3]     . DeeeER  .    .    add	w13, w30, #4095, lsl #12
[0,4]     . DeeeeER .    .    smulh	x30, x29, x28
[0,5]     .  D==eeeeER   .    smulh	x27, x30, x28
[0,6]     .  D=====eeeeER.    smulh	xzr, x27, x26
[0,7]     .    .  D=eeeeER    umulh	x30, x29, x28

I remember that I did this intentionally, but now I'm not really convinced that
this difference is worth extra complexity. Let me know what you think about
this.

I also noticed how there are no checks on NumMicroOps. Is there a reason why you don't check for it?

Good point, I will fix that.

In one of the tests, the target is dual-issue. However, there are cycles where three opcodes are dispatched.
See for example the test where two loads are dispatched in a same cycle (with the first load decoded into two uOPs).

I think this should not happen. I will add a check for NumMicroOps.

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s
117–118	Madd and add are issued in the same cycle, subs is issued next. However, they should not retire out-of-order. Some instructions can retire out-of-order, but not these. I have to look into this. Probably an RCU is actually needed for the in-order pipeline.

In D94928#2510059, @asavonic wrote:
Thanks for the review Andrea!

In D94928#2506972, @andreadb wrote:

Your model assumes an unbounded queue of instructions (something like rudimental reservation station) where to store dispatched instructions.

If you mean InstQueue, then it is bounded by Bandwidth variable - the
maximum number of instructions that can be issued in the next cycle.

Correct me if I am wrong, but in-order processor don't use a reservation station.
In the absence of structural hazards, if data dependencies are met, then uOPs are directly issued to the underlying execution units.
So the dispatch event is not decoupled from the issue event.

The fact that your patch adds an unbounded queue sounds a bit strange to me. Not sure what @dmgreen
thinks about it. But this basically means that dispatch and issue are different events.

That is true. However, the problem here is that MCA timeline view counts stalls
as a number of cycles between dispatch and issue events. If dispatch and issue
always happen in the same cycle, stalls are not displayed:
[0,3]     .  DeeER  .    .    add	w13, w30, #1
[0,4]     .  DeeeER .    .    smulh	x30, x29, x28
[0,5]     .     DeeeER   .    smulh	x27, x30, x28
[0,6]     .        DeeeER.    smulh	xzr, x27, x26
[0,7]     .    .    DeeeER    umulh	x30, x29, x28
To avoid this, the implementation emits a dispatch event for instructions that
should be executed in the next cycle. If an instruction is unable to execute due
to a hazard, it is delayed and a stall is displayed starting from the dispatch
event:
[0,3]     . DeeeER  .    .    add	w13, w30, #4095, lsl #12
[0,4]     . DeeeeER .    .    smulh	x30, x29, x28
[0,5]     .  D==eeeeER   .    smulh	x27, x30, x28
[0,6]     .  D=====eeeeER.    smulh	xzr, x27, x26
[0,7]     .    .  D=eeeeER    umulh	x30, x29, x28
I remember that I did this intentionally, but now I'm not really convinced that
this difference is worth extra complexity. Let me know what you think about
this.

Ok, I understand now.
It was done intentionally to emphasize issue stalls.

Personally I prefer if the 'dispatch' event coincides with the 'issue' event.
From a theoretical point of view, those events should really be the same.

Conceptually, "dispatch" is only really useful when describing the lifetime of an instruction in an out-of-order processor.
For those processors it makes sense to distinguish between cycles where uOPs are picked from the decoder's queue, from cycles where uOPs are actually sent by the scheduler(s) to the underlying pipeline(s).

In future, we may introduce a new option for the timeline to toggle the generation of extra chars for dispatch stalls (possibly using a different character from = to avoid confusions).
That might be contributed by a separate patch in future. For now, if you don't mind I'd rather keep your patch simple.

I also noticed how there are no checks on NumMicroOps. Is there a reason why you don't check for it?

Good point, I will fix that.

In one of the tests, the target is dual-issue. However, there are cycles where three opcodes are dispatched.
See for example the test where two loads are dispatched in a same cycle (with the first load decoded into two uOPs).

I think this should not happen. I will add a check for NumMicroOps.

Thanks!

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s
117–118	In theory, younger instructions should not be allowed to reach the write-back stage before older instructions because that would lead to out-of-order execution. In this case I was expecting a compulsory stall to artificially delay the issue of the `add` so that it can write-back in-order w.r.t. the madd. What are those cases where it is allowed to write-back instructions out of order? Shouldn't architectural commits always happen in-order?

asavonic added inline comments.Jan 20 2021, 11:22 AM

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s
117–118	In theory, younger instructions should not be allowed to reach the write-back stage before older instructions because that would lead to out-of-order execution. In this case I was expecting a compulsory stall to artificially delay the issue of the `add` so that it can write-back in-order w.r.t. the madd. I wonder how this works for instructions with early termination (sdiv, udiv). @dmgreen, can you please comment on this? What are those cases where it is allowed to write-back instructions out of order? Shouldn't architectural commits always happen in-order? From Cortex-A55 optimization manual, s3.5.1 "Instructions with out-of-order completion": While the Cortex-A55 core only issues instructions in-order, due to the number of cycles required to complete more complex floating-point and NEON instructions, out-of-order retire is allowed on the instructions described in this section. The nature of the Cortex-A55 microarchitecture is such that NEON and floating-point instructions of the same type have the same timing characteristics.

dmgreen added inline comments.Jan 20 2021, 11:29 AM

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s
117–118	Yeah, I was about to quote the same thing! The Cortex-R52, another in-order AArch32 cpu similar to the A53 mentions: https://developer.arm.com/documentation/100026/0103/Cycle-Timings-and-Interlock-Behavior/Pipeline-behavior/Dual-issuing If the Cortex-R52 processor determines that the pair must be dual-issued, it remains a pair until both instructions are retired. So I believe it can depend on the CPU and possibly the instructions issued.

andreadb added inline comments.Jan 20 2021, 1:02 PM

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s
117–118	I see. That unfortunately complicates the whole implementation. We do need a Retire Stage after all... Is out-of-order execution limited in some ways in these processors? For example: do these processors perform register renaming to break false dependencies? If we don't need to worry about register renaming, then the implementation is simpler. I really hope that out-of-order execution is only really allowed in case of completely independent instructions , and just to avoid any bottlenecks caused by long latency instructions... By the way, do we have a mechanism in place to identify instructions that can be executed out-of-order? I don't remember to have ever seen anything like that in the scheduling model. If not, then we need a way to mark those instructions somehow. In case, we might be and be able to reuse the MCSchedPredicate framework for that (fingers crossed).

Added RCU support for the in-order pipeline. Some instructions should retire out-of-order, so a new flag was added to SchedClass.

Dispatch event is now emitted in the same cycle as an issue event.

Fixed code style issues.

asavonic added inline comments.Jan 24 2021, 1:10 PM

llvm/include/llvm/MCA/Stages/InOrderIssueStage.h
42–52	Done.
llvm/lib/MCA/CMakeLists.txt
21	Thanks, fixed.
llvm/lib/MCA/Context.cpp
31–33	Done.
31–40	Done.
llvm/lib/MCA/Stages/InOrderIssueStage.cpp
35–38	Thank you. I added a check for both BeginGroup and EndGroup.
36	Added a check for NumMicroOps. Regarding the ldr: the optimization manual for Cortex-A55 states that throughput for most load instructions is 1, so I guess it is correct.
50	Done.
108–115	Thank you for the explanation! Replaced the if statement with an assert.
143	The code is changed a bit to accommodate an RCU. Let me know if it still needs a comment.
149–151	Done.
llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s
115–127	Fixed. Now a dispatch event and an issue event occur in the same cycle.
117–118	I changed the RetireStage, RCU and the scheduler model to support this. RetireControlUnit is now used if it is defined in the scheduler model. I ran a couple of tests on hardware, and MCA results seem to match it pretty well. Some instructions require out-of-order retire. I added RetireOOO flag to MCSchedClass to handle this feature. Instructions with this flag do not block other instructions and may retire as soon as they complete execution.

Harbormaster completed remote builds in B86472: Diff 318858.Jan 24 2021, 1:25 PM

Fixed LIT InvalidMCSchedClassDesc.td.

Harbormaster completed remote builds in B86524: Diff 318943.Jan 25 2021, 4:24 AM

Thanks for the updated patch.

I left a couple of minor comments below. Otherwise the design looks good to me.

@dmgreen @evgeny777, what is your opinion on this simulation pipeline? Do you think that this design might work well for other ARM in-order processors or there are other things that should be improved? Also, if you could try a few experiments with this patch, then that would be great.

Thanks,
-Andrea

llvm/include/llvm/Target/TargetSchedule.td
265–266	Note that instructions can have multiple writes (typically one per each register definition). So the comment should say that a 'write' is allowed to retire out-of-order. Also, this field is only used by llvm-mca. The value of that field is ignored if the model doesn't describe an in-order processors.
llvm/utils/TableGen/SubtargetEmitter.cpp
1105	Basically, an instruction is allowed to retire out-of-order if `RetireOOO` is true for at least one of its writes. This field is only meaningful for in-order subtargets, and is ignored for other targets. I think that this should be better described in a comment (in TargetPredicates.td).

RKSimon added inline comments.Jan 27 2021, 10:35 AM

llvm/include/llvm/MC/MCSchedule.h
120	This is a separate problem, but windows builds currently takes 4 bytes for this (before and after this patch) as bool is treated as an int. Would it be possible to change this to: uint16_t NumMicroOps : 13; uint16_t BeginGroup : 1; uint16_t EndGroup : 1; uint16_t RetireOOO : 1; Ideally replace the bools in an initial patch before this one?

laytonio added a subscriber: laytonio.Jan 27 2021, 10:49 AM

Updated the comment for RetireOOO.

asavonic added inline comments.Feb 3 2021, 7:26 AM

llvm/include/llvm/MC/MCSchedule.h
120	Thanks. Uploaded a fix to https://reviews.llvm.org/D95954.

Harbormaster completed remote builds in B87699: Diff 321087.Feb 3 2021, 7:54 AM

Now that https://reviews.llvm.org/D95954 has been fixed, there are only a few things left to do:

Disable the bottleneck analysis for in-order processors.
Add a couple of lines to the release notes describing this new feature.
Update the llvm-mca docs.

About the bottleneck analysis view:
Current implementation works under the assumption that the processor is out-of-order. The analysis internally observes events generated by the scheduler, as well as instruction state transitions. It correlates those events with scheduler queue availablity to track pressure increases due to data dependencies etc. For in-order processors, we need a completely different logic; the existing logic would not work (as you can tell from one of the tests that you have added, for which no bottlenecks were reported for data dependencies!).

We should raise a warning from the llvm-mca driver if the user explicitly requested the bottleneck analysis, and the processor is in-order. The warning should explain how the bottleneck analysis is currently unsupported for in-order processors. Note also that flag -all-views should not enable the bottleneck analysis if the simulated processor is in-order.

About the documentation: in a few paragraphs we mention "out-of-order" backends. In most cases, it is just a matter of removing the word "out-of-order", and the rest of the paragraph is still valid.
However, section "Instruction Flow" requires an extra paragraph for in-order processors. The instruction flow is much simpler for those processors, and the dispatch event is not decoupled from the issue event (it is all handled by your new "InOrderIssue" stage).
Please add a new section for your new "InOrderIssue" stage (ideally, it should go after the "Write-Back and Retire Stage" section). There you should briefly mention what happens to instructions during that stage. Please also add a small paragraph explaining how the current model allows for out-of-order execution of writes that are specially marked in the scheduling model.

I think this is all.
I plan to accept the patch once those three points are addressed.

Thanks,
-Andrea

Thanks for working on this, it looks like it will be very useful.

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
82	How come this asserts on ReadAdvance < 0? I though it was relatively common to have certain instructions requiring operands before the main stage the pipeline is based on.

andreadb added inline comments.Feb 19 2021, 3:10 AM

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
82	I agree with Dave here. This logic must also work for the case where ReadAdvance is a negative values. There is no reason why we should restrict this logic to cases where ReadAdvance is >=0. You also don't need the assert for UNKNOWN_CYCLES at line 82. If the number of "write cycles left" for a register is unknown, then you can safely ignore it, and leave `StallCycles` as is (i.e. simply `continue` to the next iteration of the loop). If however the number of write cycles left is different than UNKNOWN_CYCLES, then you should always update `StallCycles` with the std::max between the actual `StallCycles` and `CyclesLeft - ReadAdvance`. A negative ReadAdvance would effectively "increase" the `CyclesLeft` (which is what we want to model here). For this to work, CyclesLeft must be manipulated as a signed quantity (note that UNKNOWN_CYCLES is also a signed quantity; its value is -1). std::max requires for operands to be of the same type, so this may require an extra cast for StallCycles. You should then get rid of the check at line 84, and always compute the std::max, to update StallCycles.

Supported UNKNOWN_CYCLES left and negative ReadAdvance.
Disabled bottleneck analysis for in-order CPUs.
Updated documentation.

In D94928#2568386, @andreadb wrote:

Now that https://reviews.llvm.org/D95954 has been fixed, there are only a few things left to do:

Disable the bottleneck analysis for in-order processors.

Add a couple of lines to the release notes describing this new feature.

Update the llvm-mca docs.

Thank you! Can you please check if wording in the docs is ok?

Regarding the bottleneck analysis: it seems to work for some cases
(A55-all-views.s), but not for others (A55-out-of-order-retire.s).
I will check what is wrong here. The feature is disabled for now as you
suggested.

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
82	How come this asserts on ReadAdvance < 0? I though it was relatively common to have certain instructions requiring operands before the main stage the pipeline is based on. I wasn't sure that it is a valid case. Thank you for the explanation. I've adjusted the code to handle negative `ReadAdvance`. There seems to be no such cases for A55, so I added a test for M7.
82	You also don't need the assert for UNKNOWN_CYCLES at line 82. If the number of "write cycles left" for a register is unknown, then you can safely ignore it, and leave `StallCycles` as is (i.e. simply `continue` to the next iteration of the loop). I'm not sure that we can just ignore it. If it is unknown, then it is not zero, so a register hazard should be detected. I've modified the code to return 1 cycle stall, so we can check that the value is "known" in the next cycle. In any case, this should never happen, because `Instruction::execute` is called before a subsequent instruction issue, and it always sets `CyclesLeft` to a known value. If however the number of write cycles left is different than UNKNOWN_CYCLES, then you should always update `StallCycles` with the std::max between the actual `StallCycles` and `CyclesLeft - ReadAdvance`. A negative ReadAdvance would effectively "increase" the `CyclesLeft` (which is what we want to model here). For this to work, CyclesLeft must be manipulated as a signed quantity (note that UNKNOWN_CYCLES is also a signed quantity; its value is -1). std::max requires for operands to be of the same type, so this may require an extra cast for StallCycles. Right, this is fixed now. You should then get rid of the check at line 84, and always compute the std::max, to update StallCycles. The check is still needed to ensure that we don't cast a negative value to unsigned.

Harbormaster completed remote builds in B91542: Diff 327409.Mar 2 2021, 5:23 AM

In D94928#2596996, @asavonic wrote:

In D94928#2568386, @andreadb wrote:

Now that https://reviews.llvm.org/D95954 has been fixed, there are only a few things left to do:

Disable the bottleneck analysis for in-order processors.

Add a couple of lines to the release notes describing this new feature.

Update the llvm-mca docs.

Thank you! Can you please check if wording in the docs is ok?

Regarding the bottleneck analysis: it seems to work for some cases
(A55-all-views.s), but not for others (A55-out-of-order-retire.s).
I will check what is wrong here. The feature is disabled for now as you
suggested.

Thanks Andrew for the updated patch!

The doc changes look good to me.

The code changes also look good, except for one thing (I left a comment below).

About the bottleneck analysis:
The current analysis assumes an out-of-order pipeline. So I wouldn't be surprised if there are cases where it doesn't work well.
As you might already know, the bottleneck analysis effectively observes pressure events which are dynamically generated by a Scheduler component. Based on the observed trace, and it the predicts if throughput was somehow limited by the lack of processor resources, and/or data dependencies.
The good thing about your patch is that your code already generates pressure events (from your new canExecute() method). A bottleneck analysis for in-order processors would be much simpler, since we don't need any complex analysis based on the knowledge of which instructions are in a PENDING/READY state in a scheduler buffer.

The best thing to do for now is to raise an llvm-mca bug about implementing a bottleneck analysis for in-order processors. You can raise it now, or wait until this patch is finally committed (either way is fine).

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
84	As I wrote before, you should simply ignore the case where the number of cycles left is equal to UNKNOWN_CYCLES. You shouldn't early exit with an arbitrary number of cycles. Basically what I am saying is that you should replace this return statement with a `continue;`

andreadb added inline comments.Mar 2 2021, 5:48 AM

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
82	Sorry, I have only read your comment now. It is true that, strictly speaking, UNKNOWN_CYCLES isn't equivalent to zero cycles. However, in practice it shouldn't happen to have to deal with writes of UNKNOWN_CYCLES. If I remember correctly, for the out-of-order simulator, we optimistically ignore it. If you really want to enforce at least a 1-cycle delay for it, then fair enough. However you shouldn't return immediately. Instead, you should simply update the stall cycles quantity and then `continue` to the next iteration of the loop.

asavonic added inline comments.Mar 2 2021, 6:04 AM

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
84	As I wrote before, you should simply ignore the case where the number of cycles left is equal to UNKNOWN_CYCLES. You shouldn't early exit with an arbitrary number of cycles. Basically what I am saying is that you should replace this return statement with a `continue;` I mentioned this above, but I might be missing something. What exactly UNKNOWN_CYCLES means in this case? I assume that for in-order pipeline it means that an instruction is issued (since we issue in-order), but we don't know when its write is going to be completed. If this is correct, then we should not issue any instructions that depend on this write until we know that the write is completed (and has CyclesLeft == 0). `return 1` here stalls the instruction for 1 cycle, and it will be checked again in the next cycle. If we `continue` here instead, then the write is ignored.

Update StallCycles and continue instead of return,

LGTM.

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
84	OK, I did some archaeology (it has been a while since when I wrote that code.) Long story short: When simulating in-order processors, you should never end up in a situation where the number of cycles left is UNKNOWN_CYCLES. Your code was not wrong. Adding a `continue` is also OK (but not necessarily better). Ideally, you could just `assert` that the number of cycles is different than UNKNOWN_CYCLES. Up to you. Long story: Each register write is associated with a `CyclesLeft` quantity. The number of cycles left is unknown until the instruction is issued. Only at that point, we know exactly how many cycles are left before writes are committed to the register file. On instruction issue, dependent instructions get notified, so that the CyclesLeft quantities associated with their register reads are also updated. This is important when simulating an out-of-order backend, because the scheduler may contain a chain of dependent instructions, and not all the instructions of that chain are in a "ready" state. The relevant code is in Instruction.cpp (see for example method `WriteState::onInstructionIssued`). Back to our case: For in-order processors, instructions are always issued in-order, and instructions are not buffered internally by a scheduler. So the "dispatch" and "issue" are basically the same event. In the presence of data dependencies, instructions are artificially delayed until all inputs become available (i.e. inputs have a known number of cycles left). Only at that point, instructions are issued to the underlying execution units. By construction, you cannot end up in a situation where the "current instruction" depends on an older instruction which hasn't started execution yet. To put it in another way: By construction, it is always guaranteed that "older instruction" have already started execution. Therefore, the CyclesLeft value from each write is always known.

This revision is now accepted and ready to land.Mar 2 2021, 7:30 AM

Harbormaster completed remote builds in B91563: Diff 327440.Mar 2 2021, 8:00 AM

asavonic added inline comments.Mar 2 2021, 9:19 AM

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
84	Thanks for the confirmation. This matches my analysis as well. `Instruction::execute` always updates CyclesLeft, so we never have UNKNOWN_CYCLES here. Let's keep the handling in case this changes at some point.

andreadb added inline comments.Mar 2 2021, 11:34 AM

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
84	Sounds good. Thanks for working on this feature! -Andrea

This revision was landed with ongoing or failed builds.Mar 4 2021, 3:12 AM

Closed by commit rGd791695cb517: [MCA] Add support for in-order CPUs (authored by asavonic). · Explain Why

This revision was automatically updated to reflect the committed changes.

asavonic added a commit: rGd791695cb517: [MCA] Add support for in-order CPUs.

foad added a subscriber: foad.Mar 9 2021, 4:05 AM

foad mentioned this in D98628: [MCA] Disable RCU for InOrderIssueStage.Mar 15 2021, 5:01 AM

anna mentioned this in D118607: [InstCombine] Remove weaker fence adjacent to a stronger fence.Feb 1 2022, 8:12 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

MC/

MCSchedule.h

5 lines

MCA/

Context.h

5 lines

HardwareUnits/

RegisterFile.h

9 lines

RetireControlUnit.h

3 lines

Instruction.h

1 line

Stages/

InOrderIssueStage.h

84 lines

RetireStage.h

6 lines

Target/

TargetSchedule.td

2 lines

lib/

MCA/

CMakeLists.txt

1 line

Context.cpp

28 lines

HardwareUnits/

RetireControlUnit.cpp

7 lines

InstrBuilder.cpp

1 line

Stages/

InOrderIssueStage.cpp

288 lines

RetireStage.cpp

20 lines

Target/

AArch64/

AArch64SchedA55.td

6 lines

test/

tools/

llvm-mca/

AArch64/

Cortex/

A55-add-sequence.s

81 lines

A55-all-stats.s

100 lines

A55-all-views.s

139 lines

A55-in-order-retire.s

135 lines

A55-out-of-order-retire.s

136 lines

X86/

in-order-cpu.s

6 lines

tools/

llvm-mca/

llvm-mca.cpp

5 lines

utils/

TableGen/

SubtargetEmitter.cpp

7 lines

Diff 318858

llvm/include/llvm/MC/MCSchedule.h

Show First 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	struct MCReadAdvanceEntry {
}		}
};		};

/// Summarize the scheduling resources required for an instruction of a		/// Summarize the scheduling resources required for an instruction of a
/// particular scheduling class.		/// particular scheduling class.
///		///
/// Defined as an aggregate struct for creating tables with initializer lists.		/// Defined as an aggregate struct for creating tables with initializer lists.
struct MCSchedClassDesc {		struct MCSchedClassDesc {
static const unsigned short InvalidNumMicroOps = (1U << 14) - 1;		static const unsigned short InvalidNumMicroOps = (1U << 13) - 1;
static const unsigned short VariantNumMicroOps = InvalidNumMicroOps - 1;		static const unsigned short VariantNumMicroOps = InvalidNumMicroOps - 1;

#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
const char* Name;		const char* Name;
#endif		#endif
uint16_t NumMicroOps : 14;		uint16_t NumMicroOps : 13;
bool BeginGroup : 1;		bool BeginGroup : 1;
bool EndGroup : 1;		bool EndGroup : 1;
		bool RetireOOO : 1;
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - bool RetireOOO : 1; + bool RetireOOO : 1; Lint: Pre-merge checks: clang-format: please reformat the code ``` - bool RetireOOO : 1; + bool RetireOOO : 1…
		RKSimonUnsubmitted Not Done Reply Inline Actions This is a separate problem, but windows builds currently takes 4 bytes for this (before and after this patch) as bool is treated as an int. Would it be possible to change this to: uint16_t NumMicroOps : 13; uint16_t BeginGroup : 1; uint16_t EndGroup : 1; uint16_t RetireOOO : 1; Ideally replace the bools in an initial patch before this one? RKSimon: This is a separate problem, but windows builds currently takes 4 bytes for this (before and…
		asavonicAuthorUnsubmitted Done Reply Inline Actions Thanks. Uploaded a fix to https://reviews.llvm.org/D95954. asavonic: Thanks. Uploaded a fix to https://reviews.llvm.org/D95954.
uint16_t WriteProcResIdx; // First index into WriteProcResTable.		uint16_t WriteProcResIdx; // First index into WriteProcResTable.
uint16_t NumWriteProcResEntries;		uint16_t NumWriteProcResEntries;
uint16_t WriteLatencyIdx; // First index into WriteLatencyTable.		uint16_t WriteLatencyIdx; // First index into WriteLatencyTable.
uint16_t NumWriteLatencyEntries;		uint16_t NumWriteLatencyEntries;
uint16_t ReadAdvanceIdx; // First index into ReadAdvanceTable.		uint16_t ReadAdvanceIdx; // First index into ReadAdvanceTable.
uint16_t NumReadAdvanceEntries;		uint16_t NumReadAdvanceEntries;

bool isValid() const {		bool isValid() const {
▲ Show 20 Lines • Show All 258 Lines • Show Last 20 Lines

llvm/include/llvm/MCA/Context.h

Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	public:
void addHardwareUnit(std::unique_ptr<HardwareUnit> H) {		void addHardwareUnit(std::unique_ptr<HardwareUnit> H) {
Hardware.push_back(std::move(H));		Hardware.push_back(std::move(H));
}		}

/// Construct a basic pipeline for simulating an out-of-order pipeline.		/// Construct a basic pipeline for simulating an out-of-order pipeline.
/// This pipeline consists of Fetch, Dispatch, Execute, and Retire stages.		/// This pipeline consists of Fetch, Dispatch, Execute, and Retire stages.
std::unique_ptr<Pipeline> createDefaultPipeline(const PipelineOptions &Opts,		std::unique_ptr<Pipeline> createDefaultPipeline(const PipelineOptions &Opts,
SourceMgr &SrcMgr);		SourceMgr &SrcMgr);

		/// Construct a basic pipeline for simulating an in-order pipeline.
		/// This pipeline consists of Fetch, InOrderIssue, and Retire stages.
		std::unique_ptr<Pipeline> createInOrderPipeline(const PipelineOptions &Opts,
		SourceMgr &SrcMgr);
};		};

} // namespace mca		} // namespace mca
} // namespace llvm		} // namespace llvm
#endif // LLVM_MCA_CONTEXT_H		#endif // LLVM_MCA_CONTEXT_H

llvm/include/llvm/MCA/HardwareUnits/RegisterFile.h

Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines	class RegisterFile : public HardwareUnit {
void allocatePhysRegs(const RegisterRenamingInfo &Entry,		void allocatePhysRegs(const RegisterRenamingInfo &Entry,
MutableArrayRef<unsigned> UsedPhysRegs);		MutableArrayRef<unsigned> UsedPhysRegs);

// Releases previously allocated physical registers from the register file(s).		// Releases previously allocated physical registers from the register file(s).
// This method is called from `invalidateRegisterMapping()`.		// This method is called from `invalidateRegisterMapping()`.
void freePhysRegs(const RegisterRenamingInfo &Entry,		void freePhysRegs(const RegisterRenamingInfo &Entry,
MutableArrayRef<unsigned> FreedPhysRegs);		MutableArrayRef<unsigned> FreedPhysRegs);

// Collects writes that are in a RAW dependency with RS.
// This method is called from `addRegisterRead()`.
void collectWrites(const ReadState &RS,
SmallVectorImpl<WriteRef> &Writes) const;

// Create an instance of RegisterMappingTracker for every register file		// Create an instance of RegisterMappingTracker for every register file
// specified by the processor model.		// specified by the processor model.
// If no register file is specified, then this method creates a default		// If no register file is specified, then this method creates a default
// register file with an unbounded number of physical registers.		// register file with an unbounded number of physical registers.
void initialize(const MCSchedModel &SM, unsigned NumRegs);		void initialize(const MCSchedModel &SM, unsigned NumRegs);

public:		public:
RegisterFile(const MCSchedModel &SM, const MCRegisterInfo &mri,		RegisterFile(const MCSchedModel &SM, const MCRegisterInfo &mri,
unsigned NumRegs = 0);		unsigned NumRegs = 0);

		// Collects writes that are in a RAW dependency with RS.
		void collectWrites(const ReadState &RS,
		SmallVectorImpl<WriteRef> &Writes) const;

// This method updates the register mappings inserting a new register		// This method updates the register mappings inserting a new register
// definition. This method is also responsible for updating the number of		// definition. This method is also responsible for updating the number of
// allocated physical registers in each register file modified by the write.		// allocated physical registers in each register file modified by the write.
// No physical regiser is allocated if this write is from a zero-idiom.		// No physical regiser is allocated if this write is from a zero-idiom.
void addRegisterWrite(WriteRef Write, MutableArrayRef<unsigned> UsedPhysRegs);		void addRegisterWrite(WriteRef Write, MutableArrayRef<unsigned> UsedPhysRegs);

// Collect writes that are in a data dependency with RS, and update RS		// Collect writes that are in a data dependency with RS, and update RS
// internal state.		// internal state.
▲ Show 20 Lines • Show All 41 Lines • Show Last 20 Lines

llvm/include/llvm/MCA/HardwareUnits/RetireControlUnit.h

Show First 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	public:
void consumeCurrentToken();		void consumeCurrentToken();

// Update the RCU token to represent the executed state.		// Update the RCU token to represent the executed state.
void onInstructionExecuted(unsigned TokenID);		void onInstructionExecuted(unsigned TokenID);

#ifndef NDEBUG		#ifndef NDEBUG
void dump() const;		void dump() const;
#endif		#endif

		// Assigned to instructions that are not handled by the RCU.
		static const unsigned UnhandledTokenID = ~0U;
};		};

} // namespace mca		} // namespace mca
} // namespace llvm		} // namespace llvm

#endif // LLVM_MCA_RETIRE_CONTROL_UNIT_H		#endif // LLVM_MCA_RETIRE_CONTROL_UNIT_H

llvm/include/llvm/MCA/Instruction.h

Show First 20 Lines • Show All 369 Lines • ▼ Show 20 Lines	struct InstrDesc {
// subtarget when computing the reciprocal throughput.		// subtarget when computing the reciprocal throughput.
unsigned SchedClassID;		unsigned SchedClassID;

bool MayLoad;		bool MayLoad;
bool MayStore;		bool MayStore;
bool HasSideEffects;		bool HasSideEffects;
bool BeginGroup;		bool BeginGroup;
bool EndGroup;		bool EndGroup;
		bool RetireOOO;

// True if all buffered resources are in-order, and there is at least one		// True if all buffered resources are in-order, and there is at least one
// buffer which is a dispatch hazard (BufferSize = 0).		// buffer which is a dispatch hazard (BufferSize = 0).
bool MustIssueImmediately;		bool MustIssueImmediately;

// A zero latency instruction doesn't consume any scheduler resources.		// A zero latency instruction doesn't consume any scheduler resources.
bool isZeroLatency() const { return !MaxLatency && Resources.empty(); }		bool isZeroLatency() const { return !MaxLatency && Resources.empty(); }

▲ Show 20 Lines • Show All 254 Lines • Show Last 20 Lines

llvm/include/llvm/MCA/Stages/InOrderIssueStage.h

This file was added.

				//===---------------------- InOrderIssueStage.h ------------------ C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				/// \file
				///
				/// InOrderIssueStage implements an in-order execution pipeline.
				///
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_MCA_IN_ORDER_ISSUE_STAGE_H
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] not useful Lint: Pre-merge checks: clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] [[https…
				#define LLVM_MCA_IN_ORDER_ISSUE_STAGE_H

				#include "llvm/ADT/SmallVector.h"
				#include "llvm/MCA/SourceMgr.h"
				#include "llvm/MCA/Stages/Stage.h"

				#include <queue>

				namespace llvm {
				struct MCSchedModel;
				class MCSubtargetInfo;

				namespace mca {
				class RegisterFile;
				class ResourceManager;
				struct RetireControlUnit;

				class InOrderIssueStage final : public Stage {
				const MCSchedModel &SM;
				const MCSubtargetInfo &STI;
				RetireControlUnit &RCU;
				RegisterFile &PRF;
				std::unique_ptr<ResourceManager> RM;

				/// Instructions that were issued, but not executed yet.
				SmallVector<InstRef, 4> IssuedInst;

				/// Number of instructions issued in the current cycle.
				unsigned NumIssued;

				/// If an instruction cannot execute due to an unmet register or resource
				/// dependency, the it is stalled for StallCyclesLeft.
				InstRef StalledInst;
				unsigned StallCyclesLeft;

				/// Number of instructions that can be issued in the current cycle.
				unsigned Bandwidth;

				andreadbUnsubmitted Not Done Reply Inline Actions Please use unsigned quantities instead of signed integers. andreadb: Please use unsigned quantities instead of signed integers.
				asavonicAuthorUnsubmitted Done Reply Inline Actions Done. asavonic: Done.
				InOrderIssueStage(const InOrderIssueStage &Other) = delete;
				InOrderIssueStage &operator=(const InOrderIssueStage &Other) = delete;

				/// If IR has an unmet register or resource dependency, canExecute returns
				/// false. StallCycles is set to the number of cycles left before the
				/// instruction can be issued.
				bool canExecute(const InstRef &IR, unsigned *StallCycles) const;

				/// Issue the instruction, or update StallCycles if IR is stalled.
				Error tryIssue(InstRef &IR, unsigned *StallCycles);

				/// Update status of instructions from IssuedInst.
				Error updateIssuedInst();

				public:
				InOrderIssueStage(RetireControlUnit &RCU, RegisterFile &PRF,
				const MCSchedModel &SM, const MCSubtargetInfo &STI)
				: SM(SM), STI(STI), RCU(RCU), PRF(PRF),
				RM(std::make_unique<ResourceManager>(SM)), StallCyclesLeft(0),
				Bandwidth(0) {}

				bool isAvailable(const InstRef &) const override;
				bool hasWorkToComplete() const override;
				Error execute(InstRef &IR) override;
				Error cycleStart() override;
				Error cycleEnd() override;
				};

				} // namespace mca
				} // namespace llvm

				#endif // LLVM_MCA_IN_ORDER_ISSUE_STAGE_H

llvm/include/llvm/MCA/Stages/RetireStage.h

	Show All 10 Lines
	/// The RetireStage represents the process logic that interacts with the			/// The RetireStage represents the process logic that interacts with the
	/// simulated RetireControlUnit hardware.			/// simulated RetireControlUnit hardware.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_MCA_RETIRE_STAGE_H			#ifndef LLVM_MCA_RETIRE_STAGE_H
	#define LLVM_MCA_RETIRE_STAGE_H			#define LLVM_MCA_RETIRE_STAGE_H

				#include "llvm/ADT/SmallVector.h"
	#include "llvm/MCA/HardwareUnits/LSUnit.h"			#include "llvm/MCA/HardwareUnits/LSUnit.h"
	#include "llvm/MCA/HardwareUnits/RegisterFile.h"			#include "llvm/MCA/HardwareUnits/RegisterFile.h"
	#include "llvm/MCA/HardwareUnits/RetireControlUnit.h"			#include "llvm/MCA/HardwareUnits/RetireControlUnit.h"
	#include "llvm/MCA/Stages/Stage.h"			#include "llvm/MCA/Stages/Stage.h"

	namespace llvm {			namespace llvm {
	namespace mca {			namespace mca {

	class RetireStage final : public Stage {			class RetireStage final : public Stage {
	// Owner will go away when we move listeners/eventing to the stages.			// Owner will go away when we move listeners/eventing to the stages.
	RetireControlUnit &RCU;			RetireControlUnit &RCU;
	RegisterFile &PRF;			RegisterFile &PRF;
	LSUnitBase &LSU;			LSUnitBase &LSU;
				SmallVector<InstRef, 4> RetireInst;

	RetireStage(const RetireStage &Other) = delete;			RetireStage(const RetireStage &Other) = delete;
	RetireStage &operator=(const RetireStage &Other) = delete;			RetireStage &operator=(const RetireStage &Other) = delete;

	public:			public:
	RetireStage(RetireControlUnit &R, RegisterFile &F, LSUnitBase &LS)			RetireStage(RetireControlUnit &R, RegisterFile &F, LSUnitBase &LS)
	: Stage(), RCU(R), PRF(F), LSU(LS) {}			: Stage(), RCU(R), PRF(F), LSU(LS) {}

	bool hasWorkToComplete() const override { return !RCU.isEmpty(); }			bool hasWorkToComplete() const override {
				return !RCU.isEmpty() \|\| !RetireInst.empty();
				}
	Error cycleStart() override;			Error cycleStart() override;
	Error execute(InstRef &IR) override;			Error execute(InstRef &IR) override;
	void notifyInstructionRetired(const InstRef &IR) const;			void notifyInstructionRetired(const InstRef &IR) const;
	};			};

	} // namespace mca			} // namespace mca
	} // namespace llvm			} // namespace llvm

	#endif // LLVM_MCA_RETIRE_STAGE_H			#endif // LLVM_MCA_RETIRE_STAGE_H

llvm/include/llvm/Target/TargetSchedule.td

Show First 20 Lines • Show All 256 Lines • ▼ Show 20 Lines	class ProcWriteResources<list<ProcResourceKind> resources> {
bit BeginGroup = false;		bit BeginGroup = false;
bit EndGroup = false;		bit EndGroup = false;
// Allow a processor to mark some scheduling classes as unsupported		// Allow a processor to mark some scheduling classes as unsupported
// for stronger verification.		// for stronger verification.
bit Unsupported = false;		bit Unsupported = false;
// Allow a processor to mark some scheduling classes as single-issue.		// Allow a processor to mark some scheduling classes as single-issue.
// SingleIssue is an alias for Begin/End Group.		// SingleIssue is an alias for Begin/End Group.
bit SingleIssue = false;		bit SingleIssue = false;
		// Allow an instruction to retire out-of-order.
		bit RetireOOO = false;
		andreadbUnsubmitted Not Done Reply Inline Actions Note that instructions can have multiple writes (typically one per each register definition). So the comment should say that a 'write' is allowed to retire out-of-order. Also, this field is only used by llvm-mca. The value of that field is ignored if the model doesn't describe an in-order processors. andreadb: Note that instructions can have multiple writes (typically one per each register definition).
SchedMachineModel SchedModel = ?;		SchedMachineModel SchedModel = ?;
}		}

// Define the resources and latency of a SchedWrite. This will be used		// Define the resources and latency of a SchedWrite. This will be used
// directly by targets that have no itinerary classes. In this case,		// directly by targets that have no itinerary classes. In this case,
// SchedWrite is defined by the target, while WriteResources is		// SchedWrite is defined by the target, while WriteResources is
// defined by the subtarget, and maps the SchedWrite to processor		// defined by the subtarget, and maps the SchedWrite to processor
// resources.		// resources.
▲ Show 20 Lines • Show All 300 Lines • Show Last 20 Lines

llvm/lib/MCA/CMakeLists.txt

	add_llvm_component_library(LLVMMCA			add_llvm_component_library(LLVMMCA
	CodeEmitter.cpp			CodeEmitter.cpp
	Context.cpp			Context.cpp
	HWEventListener.cpp			HWEventListener.cpp
	HardwareUnits/HardwareUnit.cpp			HardwareUnits/HardwareUnit.cpp
	HardwareUnits/LSUnit.cpp			HardwareUnits/LSUnit.cpp
	HardwareUnits/RegisterFile.cpp			HardwareUnits/RegisterFile.cpp
	HardwareUnits/ResourceManager.cpp			HardwareUnits/ResourceManager.cpp
	HardwareUnits/RetireControlUnit.cpp			HardwareUnits/RetireControlUnit.cpp
	HardwareUnits/Scheduler.cpp			HardwareUnits/Scheduler.cpp
	InstrBuilder.cpp			InstrBuilder.cpp
	Instruction.cpp			Instruction.cpp
	Pipeline.cpp			Pipeline.cpp
	Stages/DispatchStage.cpp			Stages/DispatchStage.cpp
	Stages/EntryStage.cpp			Stages/EntryStage.cpp
	Stages/ExecuteStage.cpp			Stages/ExecuteStage.cpp
				Stages/InOrderIssueStage.cpp
	Stages/InstructionTables.cpp			Stages/InstructionTables.cpp
	Stages/MicroOpQueueStage.cpp			Stages/MicroOpQueueStage.cpp
	Stages/RetireStage.cpp			Stages/RetireStage.cpp
	Stages/Stage.cpp			Stages/Stage.cpp
				RKSimonUnsubmitted Not Done Reply Inline Actions sorting RKSimon: sorting
				asavonicAuthorUnsubmitted Done Reply Inline Actions Thanks, fixed. asavonic: Thanks, fixed.
	Support.cpp			Support.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${LLVM_MAIN_INCLUDE_DIR}/llvm/MCA			${LLVM_MAIN_INCLUDE_DIR}/llvm/MCA

	LINK_COMPONENTS			LINK_COMPONENTS
	MC			MC
	Support			Support
	)			)

llvm/lib/MCA/Context.cpp

Show All 15 Lines

#include "llvm/MCA/Context.h"		#include "llvm/MCA/Context.h"
#include "llvm/MCA/HardwareUnits/RegisterFile.h"		#include "llvm/MCA/HardwareUnits/RegisterFile.h"
#include "llvm/MCA/HardwareUnits/RetireControlUnit.h"		#include "llvm/MCA/HardwareUnits/RetireControlUnit.h"
#include "llvm/MCA/HardwareUnits/Scheduler.h"		#include "llvm/MCA/HardwareUnits/Scheduler.h"
#include "llvm/MCA/Stages/DispatchStage.h"		#include "llvm/MCA/Stages/DispatchStage.h"
#include "llvm/MCA/Stages/EntryStage.h"		#include "llvm/MCA/Stages/EntryStage.h"
#include "llvm/MCA/Stages/ExecuteStage.h"		#include "llvm/MCA/Stages/ExecuteStage.h"
		#include "llvm/MCA/Stages/InOrderIssueStage.h"
#include "llvm/MCA/Stages/MicroOpQueueStage.h"		#include "llvm/MCA/Stages/MicroOpQueueStage.h"
#include "llvm/MCA/Stages/RetireStage.h"		#include "llvm/MCA/Stages/RetireStage.h"

namespace llvm {		namespace llvm {
namespace mca {		namespace mca {

std::unique_ptr<Pipeline>		std::unique_ptr<Pipeline>
Context::createDefaultPipeline(const PipelineOptions &Opts, SourceMgr &SrcMgr) {		Context::createDefaultPipeline(const PipelineOptions &Opts, SourceMgr &SrcMgr) {
const MCSchedModel &SM = STI.getSchedModel();		const MCSchedModel &SM = STI.getSchedModel();
		andreadbUnsubmitted Not Done Reply Inline Actions I suggest to move the support for in-order pipeline composition into a separate function. I think it would help in terms of readability. andreadb: I suggest to move the support for in-order pipeline composition into a separate function. I…
		asavonicAuthorUnsubmitted Done Reply Inline Actions Done. asavonic: Done.

		if (!SM.isOutOfOrder())
		return createInOrderPipeline(Opts, SrcMgr);

// Create the hardware units defining the backend.		// Create the hardware units defining the backend.
auto RCU = std::make_unique<RetireControlUnit>(SM);		auto RCU = std::make_unique<RetireControlUnit>(SM);
auto PRF = std::make_unique<RegisterFile>(SM, MRI, Opts.RegisterFileSize);		auto PRF = std::make_unique<RegisterFile>(SM, MRI, Opts.RegisterFileSize);
		andreadbUnsubmitted Not Done Reply Inline Actions For readability reasons, I suggest to split this method into two methods. For example, something like `if (!SM.isOutOfOrder()) createInOrderPipeline(opts, srcMgr);`. You then need to move all the pipeline composition logic for in-order processors inside `createInOrderPipeline()`. I think it would be much more readable (just my opinion though). andreadb: For readability reasons, I suggest to split this method into two methods. For example…
		asavonicAuthorUnsubmitted Done Reply Inline Actions Done. asavonic: Done.
auto LSU = std::make_unique<LSUnit>(SM, Opts.LoadQueueSize,		auto LSU = std::make_unique<LSUnit>(SM, Opts.LoadQueueSize,
Opts.StoreQueueSize, Opts.AssumeNoAlias);		Opts.StoreQueueSize, Opts.AssumeNoAlias);
auto HWS = std::make_unique<Scheduler>(SM, *LSU);		auto HWS = std::make_unique<Scheduler>(SM, *LSU);

// Create the pipeline stages.		// Create the pipeline stages.
auto Fetch = std::make_unique<EntryStage>(SrcMgr);		auto Fetch = std::make_unique<EntryStage>(SrcMgr);
auto Dispatch = std::make_unique<DispatchStage>(STI, MRI, Opts.DispatchWidth,		auto Dispatch = std::make_unique<DispatchStage>(STI, MRI, Opts.DispatchWidth,
RCU, PRF);		RCU, PRF);
Show All 14 Lines	if (Opts.MicroOpQueueSize)
StagePipeline->appendStage(std::make_unique<MicroOpQueueStage>(		StagePipeline->appendStage(std::make_unique<MicroOpQueueStage>(
Opts.MicroOpQueueSize, Opts.DecodersThroughput));		Opts.MicroOpQueueSize, Opts.DecodersThroughput));
StagePipeline->appendStage(std::move(Dispatch));		StagePipeline->appendStage(std::move(Dispatch));
StagePipeline->appendStage(std::move(Execute));		StagePipeline->appendStage(std::move(Execute));
StagePipeline->appendStage(std::move(Retire));		StagePipeline->appendStage(std::move(Retire));
return StagePipeline;		return StagePipeline;
}		}

		std::unique_ptr<Pipeline>
		Context::createInOrderPipeline(const PipelineOptions &Opts, SourceMgr &SrcMgr) {
		const MCSchedModel &SM = STI.getSchedModel();
		auto RCU = std::make_unique<RetireControlUnit>(SM);
		auto PRF = std::make_unique<RegisterFile>(SM, MRI, Opts.RegisterFileSize);
		auto LSU = std::make_unique<LSUnit>(SM, Opts.LoadQueueSize,
		Opts.StoreQueueSize, Opts.AssumeNoAlias);

		auto Entry = std::make_unique<EntryStage>(SrcMgr);
		auto InOrderIssue = std::make_unique<InOrderIssueStage>(RCU, PRF, SM, STI);
		auto Retire = std::make_unique<RetireStage>(RCU, PRF, *LSU);

		auto StagePipeline = std::make_unique<Pipeline>();
		StagePipeline->appendStage(std::move(Entry));
		StagePipeline->appendStage(std::move(InOrderIssue));
		StagePipeline->appendStage(std::move(Retire));

		addHardwareUnit(std::move(RCU));
		addHardwareUnit(std::move(PRF));
		addHardwareUnit(std::move(LSU));

		return StagePipeline;
		}

} // namespace mca		} // namespace mca
} // namespace llvm		} // namespace llvm

llvm/lib/MCA/HardwareUnits/RetireControlUnit.cpp

Show All 27 Lines	RetireControlUnit::RetireControlUnit(const MCSchedModel &SM)
// and the maximum number of instructions retired per cycle.		// and the maximum number of instructions retired per cycle.
if (SM.hasExtraProcessorInfo()) {		if (SM.hasExtraProcessorInfo()) {
const MCExtraProcessorInfo &EPI = SM.getExtraProcessorInfo();		const MCExtraProcessorInfo &EPI = SM.getExtraProcessorInfo();
if (EPI.ReorderBufferSize)		if (EPI.ReorderBufferSize)
AvailableEntries = EPI.ReorderBufferSize;		AvailableEntries = EPI.ReorderBufferSize;
MaxRetirePerCycle = EPI.MaxRetirePerCycle;		MaxRetirePerCycle = EPI.MaxRetirePerCycle;
}		}
NumROBEntries = AvailableEntries;		NumROBEntries = AvailableEntries;
		bool IsOutOfOrder = SM.MicroOpBufferSize;
		if (!IsOutOfOrder && !NumROBEntries)
		return;
assert(NumROBEntries && "Invalid reorder buffer size!");		assert(NumROBEntries && "Invalid reorder buffer size!");
Queue.resize(2 * NumROBEntries);		Queue.resize(2 * NumROBEntries);
}		}

// Reserves a number of slots, and returns a new token.		// Reserves a number of slots, and returns a new token.
unsigned RetireControlUnit::dispatch(const InstRef &IR) {		unsigned RetireControlUnit::dispatch(const InstRef &IR) {
		if (!NumROBEntries)
		return UnhandledTokenID;

const Instruction &Inst = *IR.getInstruction();		const Instruction &Inst = *IR.getInstruction();
unsigned Entries = normalizeQuantity(Inst.getNumMicroOps());		unsigned Entries = normalizeQuantity(Inst.getNumMicroOps());
assert((AvailableEntries >= Entries) && "Reorder Buffer unavailable!");		assert((AvailableEntries >= Entries) && "Reorder Buffer unavailable!");

unsigned TokenID = NextAvailableSlotIdx;		unsigned TokenID = NextAvailableSlotIdx;
Queue[NextAvailableSlotIdx] = {IR, Entries, false};		Queue[NextAvailableSlotIdx] = {IR, Entries, false};
NextAvailableSlotIdx += std::max(1U, Entries);		NextAvailableSlotIdx += std::max(1U, Entries);
NextAvailableSlotIdx %= Queue.size();		NextAvailableSlotIdx %= Queue.size();
		assert(TokenID < UnhandledTokenID && "Invalid token ID");

AvailableEntries -= Entries;		AvailableEntries -= Entries;
return TokenID;		return TokenID;
}		}

const RetireControlUnit::RUToken &RetireControlUnit::getCurrentToken() const {		const RetireControlUnit::RUToken &RetireControlUnit::getCurrentToken() const {
const RetireControlUnit::RUToken &Current = Queue[CurrentInstructionSlotIdx];		const RetireControlUnit::RUToken &Current = Queue[CurrentInstructionSlotIdx];
#ifndef NDEBUG		#ifndef NDEBUG
▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

llvm/lib/MCA/InstrBuilder.cpp

Show First 20 Lines • Show All 564 Lines • ▼ Show 20 Lines	if (MCDesc.isReturn() && FirstReturnInst) {
FirstReturnInst = false;		FirstReturnInst = false;
}		}

ID->MayLoad = MCDesc.mayLoad();		ID->MayLoad = MCDesc.mayLoad();
ID->MayStore = MCDesc.mayStore();		ID->MayStore = MCDesc.mayStore();
ID->HasSideEffects = MCDesc.hasUnmodeledSideEffects();		ID->HasSideEffects = MCDesc.hasUnmodeledSideEffects();
ID->BeginGroup = SCDesc.BeginGroup;		ID->BeginGroup = SCDesc.BeginGroup;
ID->EndGroup = SCDesc.EndGroup;		ID->EndGroup = SCDesc.EndGroup;
		ID->RetireOOO = SCDesc.RetireOOO;

initializeUsedResources(*ID, SCDesc, STI, ProcResourceMasks);		initializeUsedResources(*ID, SCDesc, STI, ProcResourceMasks);
computeMaxLatency(*ID, MCDesc, SCDesc, STI);		computeMaxLatency(*ID, MCDesc, SCDesc, STI);

if (Error Err = verifyOperands(MCDesc, MCI))		if (Error Err = verifyOperands(MCDesc, MCI))
return std::move(Err);		return std::move(Err);

populateWrites(*ID, MCI, SchedClassID);		populateWrites(*ID, MCI, SchedClassID);
▲ Show 20 Lines • Show All 132 Lines • Show Last 20 Lines

llvm/lib/MCA/Stages/InOrderIssueStage.cpp

This file was added.

				//===---------------------- InOrderIssueStage.cpp ---------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				/// \file
				///
				/// InOrderIssueStage implements an in-order execution pipeline.
				///
				//===----------------------------------------------------------------------===//

				#include "llvm/MCA/Stages/InOrderIssueStage.h"

				#include "llvm/MC/MCSchedule.h"
				#include "llvm/MCA/HWEventListener.h"
				#include "llvm/MCA/HardwareUnits/RegisterFile.h"
				#include "llvm/MCA/HardwareUnits/ResourceManager.h"
				#include "llvm/MCA/HardwareUnits/RetireControlUnit.h"
				#include "llvm/MCA/Instruction.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/Error.h"

				#include <algorithm>

				#define DEBUG_TYPE "llvm-mca"
				namespace llvm {
				namespace mca {

				bool InOrderIssueStage::hasWorkToComplete() const {
				return !IssuedInst.empty() \|\| StalledInst;
				}

				bool InOrderIssueStage::isAvailable(const InstRef &IR) const {
				const Instruction &Inst = *IR.getInstruction();
				andreadbUnsubmitted Not Done Reply Inline Actions Shouldn't we also add checks on `NumMicroOps` somewhere? One of the tests reports two loads dispatched within a same cycle. However one of the loads is 2 uOps, so - correct me if I am wrong - it shouldn't be possible to dispatch two of those in a same cycle. Out of curiosity: ldr w4, [x2], #4 is the resource consumption info correct for that instruction? andreadb: Shouldn't we also add checks on `NumMicroOps` somewhere? One of the tests reports two loads…
				asavonicAuthorUnsubmitted Done Reply Inline Actions Added a check for NumMicroOps. Regarding the ldr: the optimization manual for Cortex-A55 states that throughput for most load instructions is 1, so I guess it is correct. asavonic: Added a check for NumMicroOps. Regarding the ldr: the optimization manual for Cortex-A55…
				unsigned NumMicroOps = Inst.getNumMicroOps();
				const InstrDesc &Desc = Inst.getDesc();
				andreadbUnsubmitted Not Done Reply Inline Actions We also need to check if IR "begins a group". Instructions that begin a group, must always be the first instructions to be dispatched in a cycle. See how the `isAvailable()` check is implemented by the DispatchStage. andreadb: We also need to check if IR "begins a group". Instructions that begin a group, must always be…
				asavonicAuthorUnsubmitted Done Reply Inline Actions Thank you. I added a check for both BeginGroup and EndGroup. asavonic: Thank you. I added a check for both BeginGroup and EndGroup.

				if (Bandwidth < NumMicroOps)
				return false;

				// Instruction with BeginGroup must be the first instruction to be issued in a
				// cycle.
				if (Desc.BeginGroup && NumIssued != 0)
				return false;

				return true;
				}

				andreadbUnsubmitted Not Done Reply Inline Actions As mentioned in another comment we should use unsigned quanitites whenever possible. So, int quantities that are only used to describe zero or more cycles should really be converted into unsigned. andreadb: As mentioned in another comment we should use unsigned quanitites whenever possible. So, int…
				asavonicAuthorUnsubmitted Done Reply Inline Actions Done. asavonic: Done.
				static bool hasResourceHazard(const ResourceManager &RM, const InstRef &IR) {
				if (RM.checkAvailability(IR.getInstruction()->getDesc())) {
				LLVM_DEBUG(dbgs() << "[E] Stall #" << IR << '\n');
				return true;
				}

				return false;
				}

				/// Return a number of cycles left until register requirements of the
				/// instructions are met.
				static unsigned checkRegisterHazard(const RegisterFile &PRF,
				const MCSchedModel &SM,
				const MCSubtargetInfo &STI,
				const InstRef &IR) {
				unsigned StallCycles = 0;
				SmallVector<WriteRef, 4> Writes;

				for (const ReadState &RS : IR.getInstruction()->getUses()) {
				const ReadDescriptor &RD = RS.getDescriptor();
				const MCSchedClassDesc *SC = SM.getSchedClassDesc(RD.SchedClassID);

				PRF.collectWrites(RS, Writes);
				for (const WriteRef &WR : Writes) {
				const WriteState *WS = WR.getWriteState();
				unsigned WriteResID = WS->getWriteResourceID();
				int ReadAdvance = STI.getReadAdvanceCycles(SC, RD.UseIndex, WriteResID);
				LLVM_DEBUG(dbgs() << "[E] ReadAdvance for #" << IR << ": " << ReadAdvance
				<< '\n');

				assert(ReadAdvance >= 0);
				assert(WS->getCyclesLeft() != UNKNOWN_CYCLES);
				dmgreenUnsubmitted Not Done Reply Inline Actions How come this asserts on ReadAdvance < 0? I though it was relatively common to have certain instructions requiring operands before the main stage the pipeline is based on. dmgreen: How come this asserts on ReadAdvance < 0? I though it was relatively common to have certain…
				andreadbUnsubmitted Not Done Reply Inline Actions I agree with Dave here. This logic must also work for the case where ReadAdvance is a negative values. There is no reason why we should restrict this logic to cases where ReadAdvance is >=0. You also don't need the assert for UNKNOWN_CYCLES at line 82. If the number of "write cycles left" for a register is unknown, then you can safely ignore it, and leave `StallCycles` as is (i.e. simply `continue` to the next iteration of the loop). If however the number of write cycles left is different than UNKNOWN_CYCLES, then you should always update `StallCycles` with the std::max between the actual `StallCycles` and `CyclesLeft - ReadAdvance`. A negative ReadAdvance would effectively "increase" the `CyclesLeft` (which is what we want to model here). For this to work, CyclesLeft must be manipulated as a signed quantity (note that UNKNOWN_CYCLES is also a signed quantity; its value is -1). std::max requires for operands to be of the same type, so this may require an extra cast for StallCycles. You should then get rid of the check at line 84, and always compute the std::max, to update StallCycles. andreadb: I agree with Dave here. This logic must also work for the case where ReadAdvance is a negative…
				asavonicAuthorUnsubmitted Done Reply Inline Actions You also don't need the assert for UNKNOWN_CYCLES at line 82. If the number of "write cycles left" for a register is unknown, then you can safely ignore it, and leave `StallCycles` as is (i.e. simply `continue` to the next iteration of the loop). I'm not sure that we can just ignore it. If it is unknown, then it is not zero, so a register hazard should be detected. I've modified the code to return 1 cycle stall, so we can check that the value is "known" in the next cycle. In any case, this should never happen, because `Instruction::execute` is called before a subsequent instruction issue, and it always sets `CyclesLeft` to a known value. If however the number of write cycles left is different than UNKNOWN_CYCLES, then you should always update `StallCycles` with the std::max between the actual `StallCycles` and `CyclesLeft - ReadAdvance`. A negative ReadAdvance would effectively "increase" the `CyclesLeft` (which is what we want to model here). For this to work, CyclesLeft must be manipulated as a signed quantity (note that UNKNOWN_CYCLES is also a signed quantity; its value is -1). std::max requires for operands to be of the same type, so this may require an extra cast for StallCycles. Right, this is fixed now. You should then get rid of the check at line 84, and always compute the std::max, to update StallCycles. The check is still needed to ensure that we don't cast a negative value to unsigned. asavonic: > You also don't need the assert for UNKNOWN_CYCLES at line 82. If the number of "write cycles…
				andreadbUnsubmitted Not Done Reply Inline Actions Sorry, I have only read your comment now. It is true that, strictly speaking, UNKNOWN_CYCLES isn't equivalent to zero cycles. However, in practice it shouldn't happen to have to deal with writes of UNKNOWN_CYCLES. If I remember correctly, for the out-of-order simulator, we optimistically ignore it. If you really want to enforce at least a 1-cycle delay for it, then fair enough. However you shouldn't return immediately. Instead, you should simply update the stall cycles quantity and then `continue` to the next iteration of the loop. andreadb: Sorry, I have only read your comment now. It is true that, strictly speaking, UNKNOWN_CYCLES…
				asavonicAuthorUnsubmitted Done Reply Inline Actions How come this asserts on ReadAdvance < 0? I though it was relatively common to have certain instructions requiring operands before the main stage the pipeline is based on. I wasn't sure that it is a valid case. Thank you for the explanation. I've adjusted the code to handle negative `ReadAdvance`. There seems to be no such cases for A55, so I added a test for M7. asavonic: > How come this asserts on ReadAdvance < 0? I though it was relatively common to have certain…
				unsigned CyclesLeft = WS->getCyclesLeft();
				if (CyclesLeft > (unsigned)ReadAdvance) {
				andreadbUnsubmitted Not Done Reply Inline Actions As I wrote before, you should simply ignore the case where the number of cycles left is equal to UNKNOWN_CYCLES. You shouldn't early exit with an arbitrary number of cycles. Basically what I am saying is that you should replace this return statement with a `continue;` andreadb: As I wrote before, you should simply ignore the case where the number of cycles left is equal…
				asavonicAuthorUnsubmitted Done Reply Inline Actions As I wrote before, you should simply ignore the case where the number of cycles left is equal to UNKNOWN_CYCLES. You shouldn't early exit with an arbitrary number of cycles. Basically what I am saying is that you should replace this return statement with a `continue;` I mentioned this above, but I might be missing something. What exactly UNKNOWN_CYCLES means in this case? I assume that for in-order pipeline it means that an instruction is issued (since we issue in-order), but we don't know when its write is going to be completed. If this is correct, then we should not issue any instructions that depend on this write until we know that the write is completed (and has CyclesLeft == 0). `return 1` here stalls the instruction for 1 cycle, and it will be checked again in the next cycle. If we `continue` here instead, then the write is ignored. asavonic: > As I wrote before, you should simply ignore the case where the number of cycles left is equal…
				andreadbUnsubmitted Not Done Reply Inline Actions OK, I did some archaeology (it has been a while since when I wrote that code.) Long story short: When simulating in-order processors, you should never end up in a situation where the number of cycles left is UNKNOWN_CYCLES. Your code was not wrong. Adding a `continue` is also OK (but not necessarily better). Ideally, you could just `assert` that the number of cycles is different than UNKNOWN_CYCLES. Up to you. Long story: Each register write is associated with a `CyclesLeft` quantity. The number of cycles left is unknown until the instruction is issued. Only at that point, we know exactly how many cycles are left before writes are committed to the register file. On instruction issue, dependent instructions get notified, so that the CyclesLeft quantities associated with their register reads are also updated. This is important when simulating an out-of-order backend, because the scheduler may contain a chain of dependent instructions, and not all the instructions of that chain are in a "ready" state. The relevant code is in Instruction.cpp (see for example method `WriteState::onInstructionIssued`). Back to our case: For in-order processors, instructions are always issued in-order, and instructions are not buffered internally by a scheduler. So the "dispatch" and "issue" are basically the same event. In the presence of data dependencies, instructions are artificially delayed until all inputs become available (i.e. inputs have a known number of cycles left). Only at that point, instructions are issued to the underlying execution units. By construction, you cannot end up in a situation where the "current instruction" depends on an older instruction which hasn't started execution yet. To put it in another way: By construction, it is always guaranteed that "older instruction" have already started execution. Therefore, the CyclesLeft value from each write is always known. andreadb: OK, I did some archaeology (it has been a while since when I wrote that code.) Long story…
				asavonicAuthorUnsubmitted Done Reply Inline Actions Thanks for the confirmation. This matches my analysis as well. `Instruction::execute` always updates CyclesLeft, so we never have UNKNOWN_CYCLES here. Let's keep the handling in case this changes at some point. asavonic: Thanks for the confirmation. This matches my analysis as well. `Instruction::execute` always…
				andreadbUnsubmitted Not Done Reply Inline Actions Sounds good. Thanks for working on this feature! -Andrea andreadb: Sounds good. Thanks for working on this feature! -Andrea
				LLVM_DEBUG(dbgs() << "[E] Register hazard: " << WS->getRegisterID()
				<< '\n');
				StallCycles = std::max(StallCycles, CyclesLeft - ReadAdvance);
				}
				}
				Writes.clear();
				}

				return StallCycles;
				}

				bool InOrderIssueStage::canExecute(const InstRef &IR,
				unsigned *StallCycles) const {
				*StallCycles = 0;

				if (unsigned RegStall = checkRegisterHazard(PRF, SM, STI, IR)) {
				*StallCycles = RegStall;
				// FIXME: add a parameter to HWStallEvent to indicate a number of cycles.
				for (unsigned I = 0; I < RegStall; ++I) {
				notifyEvent<HWStallEvent>(
				HWStallEvent(HWStallEvent::RegisterFileStall, IR));
				notifyEvent<HWPressureEvent>(
				HWPressureEvent(HWPressureEvent::REGISTER_DEPS, IR));
				}
				} else if (hasResourceHazard(*RM, IR)) {
				*StallCycles = 1;
				notifyEvent<HWStallEvent>(
				HWStallEvent(HWStallEvent::DispatchGroupStall, IR));
				notifyEvent<HWPressureEvent>(
				HWPressureEvent(HWPressureEvent::RESOURCES, IR));
				}
				andreadbUnsubmitted Not Done Reply Inline Actions I expect register renaming to only occur for out-of-order processors. So move elimination shouldn't ever happen in this context. That being said, instructions that are fully eliminated at register renaming stage still have their writes propagated to the register file. So, if you want to correctly handle that case, then you cannot early exit; you still need to add writes. Note that the only instructions that can be eliminated at register renaming stage are register moves. See the definition and description of tablegen class `RegisterFile` in https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Target/TargetSchedule.td I don't expect any of this to really affect in-order targets. So, personally I suggest to replace the initial if-stmt with an assert (i.e. check tha IS is NOT eliminated). andreadb: I expect register renaming to only occur for out-of-order processors. So move elimination…
				asavonicAuthorUnsubmitted Done Reply Inline Actions Thank you for the explanation! Replaced the if statement with an assert. asavonic: Thank you for the explanation! Replaced the if statement with an assert.

				return *StallCycles == 0;
				}

				static void addRegisterReadWrite(RegisterFile &PRF, Instruction &IS,
				unsigned SourceIndex,
				const MCSubtargetInfo &STI,
				SmallVectorImpl<unsigned> &UsedRegs) {
				assert(!IS.isEliminated());

				for (ReadState &RS : IS.getUses())
				PRF.addRegisterRead(RS, STI);

				for (WriteState &WS : IS.getDefs())
				PRF.addRegisterWrite(WriteRef(SourceIndex, &WS), UsedRegs);
				}

				static void notifyInstructionExecute(
				const InstRef &IR,
				const SmallVectorImpl<std::pair<ResourceRef, ResourceCycles>> &UsedRes,
				const Stage &S) {

				S.notifyEvent<HWInstructionEvent>(
				HWInstructionEvent(HWInstructionEvent::Ready, IR));
				S.notifyEvent<HWInstructionEvent>(HWInstructionIssuedEvent(IR, UsedRes));

				LLVM_DEBUG(dbgs() << "[E] Issued #" << IR << "\n");
				}
				andreadbUnsubmitted Not Done Reply Inline Actions Maybe add a comment here explaining why we are passing 0 as token-id. Since this is an in-order processor, the retire stage (and therefore the token id) are not really required in practice. We don't expect the retire stage to do anything in particular. andreadb: Maybe add a comment here explaining why we are passing 0 as token-id. Since this is an in-order…
				asavonicAuthorUnsubmitted Done Reply Inline Actions The code is changed a bit to accommodate an RCU. Let me know if it still needs a comment. asavonic: The code is changed a bit to accommodate an RCU. Let me know if it still needs a comment.

				static void notifyInstructionDispatch(const InstRef &IR, unsigned Ops,
				const SmallVectorImpl<unsigned> &UsedRegs,
				const Stage &S) {

				S.notifyEvent<HWInstructionEvent>(
				HWInstructionDispatchedEvent(IR, UsedRegs, Ops));

				andreadbUnsubmitted Not Done Reply Inline Actions `Bandwidth` should be set to zero if IR `ends a group`. You probaby need something semantically equivalent to what method DispatchStage::dispatch(InstRef IR) does: // Check if this instructions ends the dispatch group. if (Desc.EndGroup) AvailableEntries = 0; andreadb: `Bandwidth` should be set to zero if IR `ends a group`. You probaby need something…
				asavonicAuthorUnsubmitted Done Reply Inline Actions Done. asavonic: Done.
				LLVM_DEBUG(dbgs() << "[E] Dispatched #" << IR << "\n");
				}

				llvm::Error InOrderIssueStage::execute(InstRef &IR) {
				Instruction &IS = *IR.getInstruction();
				const InstrDesc &Desc = IS.getDesc();

				unsigned RCUTokenID = RetireControlUnit::UnhandledTokenID;
				if (!Desc.RetireOOO)
				RCUTokenID = RCU.dispatch(IR);
				IS.dispatch(RCUTokenID);

				if (Desc.EndGroup) {
				Bandwidth = 0;
				} else {
				unsigned NumMicroOps = IR.getInstruction()->getNumMicroOps();
				assert(Bandwidth >= NumMicroOps);
				Bandwidth -= NumMicroOps;
				}

				if (llvm::Error E = tryIssue(IR, &StallCyclesLeft))
				return E;

				if (StallCyclesLeft) {
				StalledInst = IR;
				Bandwidth = 0;
				}

				return llvm::ErrorSuccess();
				}

				llvm::Error InOrderIssueStage::tryIssue(InstRef &IR, unsigned *StallCycles) {
				Instruction &IS = *IR.getInstruction();
				unsigned SourceIndex = IR.getSourceIndex();

				if (!canExecute(IR, StallCycles)) {
				LLVM_DEBUG(dbgs() << "[E] Stalled #" << IR << " for " << *StallCycles
				<< " cycles\n");
				return llvm::ErrorSuccess();
				}

				SmallVector<unsigned, 4> UsedRegs(PRF.getNumRegisterFiles());
				addRegisterReadWrite(PRF, IS, SourceIndex, STI, UsedRegs);

				notifyInstructionDispatch(IR, IS.getDesc().NumMicroOps, UsedRegs, *this);

				SmallVector<std::pair<ResourceRef, ResourceCycles>, 4> UsedResources;
				RM->issueInstruction(IS.getDesc(), UsedResources);
				IS.execute(SourceIndex);

				// Replace resource masks with valid resource processor IDs.
				for (std::pair<ResourceRef, ResourceCycles> &Use : UsedResources) {
				uint64_t Mask = Use.first.first;
				Use.first.first = RM->resolveResourceMask(Mask);
				}
				notifyInstructionExecute(IR, UsedResources, *this);

				IssuedInst.push_back(IR);
				++NumIssued;

				return llvm::ErrorSuccess();
				}

				llvm::Error InOrderIssueStage::updateIssuedInst() {
				// Update other instructions. Executed instructions will be retired during the
				// next cycle.
				unsigned NumExecuted = 0;
				for (auto I = IssuedInst.begin(), E = IssuedInst.end();
				I != (E - NumExecuted);) {
				InstRef &IR = *I;
				Instruction &IS = *IR.getInstruction();

				IS.cycleEvent();
				if (!IS.isExecuted()) {
				LLVM_DEBUG(dbgs() << "[E] Instruction #" << IR
				<< " is still executing\n");
				++I;
				continue;
				}
				notifyEvent<HWInstructionEvent>(
				HWInstructionEvent(HWInstructionEvent::Executed, IR));

				LLVM_DEBUG(dbgs() << "[E] Instruction #" << IR << " is executed\n");
				++NumExecuted;
				std::iter_swap(I, E - NumExecuted);
				}

				// Retire instructions in the next cycle
				if (NumExecuted) {
				for (auto I = IssuedInst.end() - NumExecuted, E = IssuedInst.end(); I != E;
				++I) {
				if (llvm::Error E = moveToTheNextStage(*I))
				return E;
				}
				IssuedInst.resize(IssuedInst.size() - NumExecuted);
				}

				return llvm::ErrorSuccess();
				}

				llvm::Error InOrderIssueStage::cycleStart() {
				NumIssued = 0;

				// Release consumed resources.
				SmallVector<ResourceRef, 4> Freed;
				RM->cycleEvent(Freed);

				if (llvm::Error E = updateIssuedInst())
				return E;

				// Issue instructions scheduled for this cycle
				if (!StallCyclesLeft && StalledInst) {
				if (llvm::Error E = tryIssue(StalledInst, &StallCyclesLeft))
				return E;
				}

				if (!StallCyclesLeft) {
				StalledInst.invalidate();
				assert(NumIssued <= SM.IssueWidth && "Overflow.");
				Bandwidth = SM.IssueWidth - NumIssued;
				} else {
				// The instruction is still stalled, cannot issue any new instructions in
				// this cycle.
				Bandwidth = 0;
				}

				return llvm::ErrorSuccess();
				}

				llvm::Error InOrderIssueStage::cycleEnd() {
				if (StallCyclesLeft > 0)
				--StallCyclesLeft;
				return llvm::ErrorSuccess();
				}

				} // namespace mca
				} // namespace llvm

llvm/lib/MCA/Stages/RetireStage.cpp

	Show All 17 Lines
	#include "llvm/Support/Debug.h"			#include "llvm/Support/Debug.h"

	#define DEBUG_TYPE "llvm-mca"			#define DEBUG_TYPE "llvm-mca"

	namespace llvm {			namespace llvm {
	namespace mca {			namespace mca {

	llvm::Error RetireStage::cycleStart() {			llvm::Error RetireStage::cycleStart() {
	if (RCU.isEmpty())
	return llvm::ErrorSuccess();

	const unsigned MaxRetirePerCycle = RCU.getMaxRetirePerCycle();			const unsigned MaxRetirePerCycle = RCU.getMaxRetirePerCycle();
	unsigned NumRetired = 0;			unsigned NumRetired = 0;
	while (!RCU.isEmpty()) {			while (!RCU.isEmpty()) {
	if (MaxRetirePerCycle != 0 && NumRetired == MaxRetirePerCycle)			if (MaxRetirePerCycle != 0 && NumRetired == MaxRetirePerCycle)
	break;			break;
	const RetireControlUnit::RUToken &Current = RCU.getCurrentToken();			const RetireControlUnit::RUToken &Current = RCU.getCurrentToken();
	if (!Current.Executed)			if (!Current.Executed)
	break;			break;
	notifyInstructionRetired(Current.IR);			notifyInstructionRetired(Current.IR);
	RCU.consumeCurrentToken();			RCU.consumeCurrentToken();
	NumRetired++;			NumRetired++;
	}			}

				// Retire instructions that are not controlled by the RCU
				for (InstRef &IR : RetireInst) {
				IR.getInstruction()->retire();
				notifyInstructionRetired(IR);
				}
				RetireInst.resize(0);

	return llvm::ErrorSuccess();			return llvm::ErrorSuccess();
	}			}

	llvm::Error RetireStage::execute(InstRef &IR) {			llvm::Error RetireStage::execute(InstRef &IR) {
	RCU.onInstructionExecuted(IR.getInstruction()->getRCUTokenID());			Instruction &IS = *IR.getInstruction();

				unsigned TokenID = IS.getRCUTokenID();
				if (TokenID != RetireControlUnit::UnhandledTokenID) {
				RCU.onInstructionExecuted(TokenID);
				return llvm::ErrorSuccess();
				}

				RetireInst.push_back(IR);
	return llvm::ErrorSuccess();			return llvm::ErrorSuccess();
	}			}

	void RetireStage::notifyInstructionRetired(const InstRef &IR) const {			void RetireStage::notifyInstructionRetired(const InstRef &IR) const {
	LLVM_DEBUG(llvm::dbgs() << "[E] Instruction Retired: #" << IR << '\n');			LLVM_DEBUG(llvm::dbgs() << "[E] Instruction Retired: #" << IR << '\n');
	llvm::SmallVector<unsigned, 4> FreedRegs(PRF.getNumRegisterFiles());			llvm::SmallVector<unsigned, 4> FreedRegs(PRF.getNumRegisterFiles());
	const Instruction &Inst = *IR.getInstruction();			const Instruction &Inst = *IR.getInstruction();

	Show All 11 Lines

llvm/lib/Target/AArch64/AArch64SchedA55.td

	Show First 20 Lines • Show All 145 Lines • ▼ Show 20 Lines

	// FP ALU specific new schedwrite definitions			// FP ALU specific new schedwrite definitions
	def CortexA55WriteFPALU_F3 : SchedWriteRes<[CortexA55UnitFPALU]> { let Latency = 3;}			def CortexA55WriteFPALU_F3 : SchedWriteRes<[CortexA55UnitFPALU]> { let Latency = 3;}
	def CortexA55WriteFPALU_F4 : SchedWriteRes<[CortexA55UnitFPALU]> { let Latency = 4;}			def CortexA55WriteFPALU_F4 : SchedWriteRes<[CortexA55UnitFPALU]> { let Latency = 4;}
	def CortexA55WriteFPALU_F5 : SchedWriteRes<[CortexA55UnitFPALU]> { let Latency = 5;}			def CortexA55WriteFPALU_F5 : SchedWriteRes<[CortexA55UnitFPALU]> { let Latency = 5;}

	// FP Mul, Div, Sqrt. Div/Sqrt are not pipelined			// FP Mul, Div, Sqrt. Div/Sqrt are not pipelined
	def : WriteRes<WriteFMul, [CortexA55UnitFPMAC]> { let Latency = 4; }			def : WriteRes<WriteFMul, [CortexA55UnitFPMAC]> { let Latency = 4; }

				let RetireOOO = 1 in {
	def : WriteRes<WriteFDiv, [CortexA55UnitFPDIV]> { let Latency = 22;			def : WriteRes<WriteFDiv, [CortexA55UnitFPDIV]> { let Latency = 22;
	let ResourceCycles = [29]; }			let ResourceCycles = [29]; }
	def CortexA55WriteFMAC : SchedWriteRes<[CortexA55UnitFPMAC]> { let Latency = 4; }			def CortexA55WriteFMAC : SchedWriteRes<[CortexA55UnitFPMAC]> { let Latency = 4; }
	def CortexA55WriteFDivHP : SchedWriteRes<[CortexA55UnitFPDIV]> { let Latency = 8;			def CortexA55WriteFDivHP : SchedWriteRes<[CortexA55UnitFPDIV]> { let Latency = 8;
	let ResourceCycles = [5]; }			let ResourceCycles = [5]; }
	def CortexA55WriteFDivSP : SchedWriteRes<[CortexA55UnitFPDIV]> { let Latency = 13;			def CortexA55WriteFDivSP : SchedWriteRes<[CortexA55UnitFPDIV]> { let Latency = 13;
	let ResourceCycles = [10]; }			let ResourceCycles = [10]; }
	def CortexA55WriteFDivDP : SchedWriteRes<[CortexA55UnitFPDIV]> { let Latency = 22;			def CortexA55WriteFDivDP : SchedWriteRes<[CortexA55UnitFPDIV]> { let Latency = 22;
	let ResourceCycles = [19]; }			let ResourceCycles = [19]; }
	def CortexA55WriteFSqrtHP : SchedWriteRes<[CortexA55UnitFPDIV]> { let Latency = 8;			def CortexA55WriteFSqrtHP : SchedWriteRes<[CortexA55UnitFPDIV]> { let Latency = 8;
	let ResourceCycles = [5]; }			let ResourceCycles = [5]; }
	def CortexA55WriteFSqrtSP : SchedWriteRes<[CortexA55UnitFPDIV]> { let Latency = 12;			def CortexA55WriteFSqrtSP : SchedWriteRes<[CortexA55UnitFPDIV]> { let Latency = 12;
	let ResourceCycles = [9]; }			let ResourceCycles = [9]; }
	def CortexA55WriteFSqrtDP : SchedWriteRes<[CortexA55UnitFPDIV]> { let Latency = 22;			def CortexA55WriteFSqrtDP : SchedWriteRes<[CortexA55UnitFPDIV]> { let Latency = 22;
	let ResourceCycles = [19]; }			let ResourceCycles = [19]; }
				}
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Subtarget-specific SchedRead types.			// Subtarget-specific SchedRead types.

	def : ReadAdvance<ReadVLD, 0>;			def : ReadAdvance<ReadVLD, 0>;
	def : ReadAdvance<ReadExtrHi, 1>;			def : ReadAdvance<ReadExtrHi, 1>;
	def : ReadAdvance<ReadAdrBase, 1>;			def : ReadAdvance<ReadAdrBase, 1>;

	// ALU - ALU input operands are generally needed in EX1. An operand produced in			// ALU - ALU input operands are generally needed in EX1. An operand produced in
	▲ Show 20 Lines • Show All 153 Lines • ▼ Show 20 Lines
	def : InstRW<[CortexA55WriteFDivSP], (instrs FDIVSrr)>;			def : InstRW<[CortexA55WriteFDivSP], (instrs FDIVSrr)>;
	def : InstRW<[CortexA55WriteFDivDP], (instrs FDIVDrr)>;			def : InstRW<[CortexA55WriteFDivDP], (instrs FDIVDrr)>;
	def : InstRW<[CortexA55WriteFDivHP], (instregex "^FDIVv.*16$")>;			def : InstRW<[CortexA55WriteFDivHP], (instregex "^FDIVv.*16$")>;
	def : InstRW<[CortexA55WriteFDivSP], (instregex "^FDIVv.*32$")>;			def : InstRW<[CortexA55WriteFDivSP], (instregex "^FDIVv.*32$")>;
	def : InstRW<[CortexA55WriteFDivDP], (instregex "^FDIVv.*64$")>;			def : InstRW<[CortexA55WriteFDivDP], (instregex "^FDIVv.*64$")>;
	def : InstRW<[CortexA55WriteFSqrtHP], (instregex "^.SQRT.16$")>;			def : InstRW<[CortexA55WriteFSqrtHP], (instregex "^.SQRT.16$")>;
	def : InstRW<[CortexA55WriteFSqrtSP], (instregex "^.SQRT.32$")>;			def : InstRW<[CortexA55WriteFSqrtSP], (instregex "^.SQRT.32$")>;
	def : InstRW<[CortexA55WriteFSqrtDP], (instregex "^.SQRT.64$")>;			def : InstRW<[CortexA55WriteFSqrtDP], (instregex "^.SQRT.64$")>;

				def A55RCU : RetireControlUnit<64, 0>;
	}			}

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-add-sequence.s

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
				# RUN: llvm-mca -mtriple=aarch64 -mcpu=cortex-a55 --timeline --iterations=2 < %s \| FileCheck %s

				add w2, w3, #1
				add w4, w3, #2, lsl #12
				add w0, w4, #3
				add w1, w0, #4

				# CHECK: Iterations: 2
				# CHECK-NEXT: Instructions: 8
				# CHECK-NEXT: Total Cycles: 10
				# CHECK-NEXT: Total uOps: 8

				# CHECK: Dispatch Width: 2
				# CHECK-NEXT: uOps Per Cycle: 0.80
				# CHECK-NEXT: IPC: 0.80
				# CHECK-NEXT: Block RThroughput: 2.0

				# CHECK: Instruction Info:
				# CHECK-NEXT: [1]: #uOps
				# CHECK-NEXT: [2]: Latency
				# CHECK-NEXT: [3]: RThroughput
				# CHECK-NEXT: [4]: MayLoad
				# CHECK-NEXT: [5]: MayStore
				# CHECK-NEXT: [6]: HasSideEffects (U)

				# CHECK: [1] [2] [3] [4] [5] [6] Instructions:
				# CHECK-NEXT: 1 3 0.50 add w2, w3, #1
				# CHECK-NEXT: 1 3 0.50 add w4, w3, #2, lsl #12
				# CHECK-NEXT: 1 3 0.50 add w0, w4, #3
				# CHECK-NEXT: 1 3 0.50 add w1, w0, #4

				# CHECK: Resources:
				# CHECK-NEXT: [0.0] - CortexA55UnitALU
				# CHECK-NEXT: [0.1] - CortexA55UnitALU
				# CHECK-NEXT: [1] - CortexA55UnitB
				# CHECK-NEXT: [2] - CortexA55UnitDiv
				# CHECK-NEXT: [3.0] - CortexA55UnitFPALU
				# CHECK-NEXT: [3.1] - CortexA55UnitFPALU
				# CHECK-NEXT: [4] - CortexA55UnitFPDIV
				# CHECK-NEXT: [5.0] - CortexA55UnitFPMAC
				# CHECK-NEXT: [5.1] - CortexA55UnitFPMAC
				# CHECK-NEXT: [6] - CortexA55UnitLd
				# CHECK-NEXT: [7] - CortexA55UnitMAC
				# CHECK-NEXT: [8] - CortexA55UnitSt

				# CHECK: Resource pressure per iteration:
				# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8]
				# CHECK-NEXT: 2.00 2.00 - - - - - - - - - -

				# CHECK: Resource pressure by instruction:
				# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8] Instructions:
				# CHECK-NEXT: - 1.00 - - - - - - - - - - add w2, w3, #1
				# CHECK-NEXT: 1.00 - - - - - - - - - - - add w4, w3, #2, lsl #12
				# CHECK-NEXT: - 1.00 - - - - - - - - - - add w0, w4, #3
				# CHECK-NEXT: 1.00 - - - - - - - - - - - add w1, w0, #4

				# CHECK: Timeline view:
				# CHECK-NEXT: Index 0123456789

				# CHECK: [0,0] DeeER. . add w2, w3, #1
				# CHECK-NEXT: [0,1] DeeER. . add w4, w3, #2, lsl #12
				# CHECK-NEXT: [0,2] .DeeER . add w0, w4, #3
				# CHECK-NEXT: [0,3] . DeeER . add w1, w0, #4
				# CHECK-NEXT: [1,0] . DeeER . add w2, w3, #1
				# CHECK-NEXT: [1,1] . DeeER . add w4, w3, #2, lsl #12
				# CHECK-NEXT: [1,2] . DeeER. add w0, w4, #3
				# CHECK-NEXT: [1,3] . DeeER add w1, w0, #4

				# CHECK: Average Wait times (based on the timeline view):
				# CHECK-NEXT: [0]: Executions
				# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
				# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
				# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage

				# CHECK: [0] [1] [2] [3]
				# CHECK-NEXT: 0. 2 0.0 0.0 0.0 add w2, w3, #1
				# CHECK-NEXT: 1. 2 0.0 0.0 0.0 add w4, w3, #2, lsl #12
				# CHECK-NEXT: 2. 2 0.0 0.0 0.0 add w0, w4, #3
				# CHECK-NEXT: 3. 2 0.0 0.0 0.0 add w1, w0, #4
				# CHECK-NEXT: 2 0.0 0.0 0.0 <total>

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-stats.s

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
				# RUN: llvm-mca -mtriple=aarch64 -mcpu=cortex-a55 --all-stats --iterations=2 < %s \| FileCheck %s

				ldr w4, [x2], #4
				ldr w5, [x3]
				madd w0, w5, w4, w0
				add x3, x3, x13
				subs x1, x1, #1
				str w0, [x21, x18, lsl #2]

				# CHECK: Iterations: 2
				# CHECK-NEXT: Instructions: 12
				# CHECK-NEXT: Total Cycles: 21
				# CHECK-NEXT: Total uOps: 14

				# CHECK: Dispatch Width: 2
				# CHECK-NEXT: uOps Per Cycle: 0.67
				# CHECK-NEXT: IPC: 0.57
				# CHECK-NEXT: Block RThroughput: 3.5

				# CHECK: Instruction Info:
				# CHECK-NEXT: [1]: #uOps
				# CHECK-NEXT: [2]: Latency
				# CHECK-NEXT: [3]: RThroughput
				# CHECK-NEXT: [4]: MayLoad
				# CHECK-NEXT: [5]: MayStore
				# CHECK-NEXT: [6]: HasSideEffects (U)

				# CHECK: [1] [2] [3] [4] [5] [6] Instructions:
				# CHECK-NEXT: 2 3 1.00 * ldr w4, [x2], #4
				# CHECK-NEXT: 1 3 1.00 * ldr w5, [x3]
				# CHECK-NEXT: 1 4 1.00 madd w0, w5, w4, w0
				# CHECK-NEXT: 1 3 0.50 add x3, x3, x13
				# CHECK-NEXT: 1 3 0.50 subs x1, x1, #1
				# CHECK-NEXT: 1 4 1.00 * str w0, [x21, x18, lsl #2]

				# CHECK: Dynamic Dispatch Stall Cycles:
				# CHECK-NEXT: RAT - Register unavailable: 10 (47.6%)
				# CHECK-NEXT: RCU - Retire tokens unavailable: 0
				# CHECK-NEXT: SCHEDQ - Scheduler full: 0
				# CHECK-NEXT: LQ - Load queue full: 0
				# CHECK-NEXT: SQ - Store queue full: 0
				# CHECK-NEXT: GROUP - Static restrictions on the dispatch group: 0

				# CHECK: Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
				# CHECK-NEXT: [# dispatched], [# cycles]
				# CHECK-NEXT: 0, 11 (52.4%)
				# CHECK-NEXT: 1, 6 (28.6%)
				# CHECK-NEXT: 2, 4 (19.0%)

				# CHECK: Schedulers - number of cycles where we saw N micro opcodes issued:
				# CHECK-NEXT: [# issued], [# cycles]
				# CHECK-NEXT: 0, 11 (52.4%)
				# CHECK-NEXT: 1, 6 (28.6%)
				# CHECK-NEXT: 2, 4 (19.0%)

				# CHECK: Scheduler's queue usage:
				# CHECK-NEXT: No scheduler resources used.

				# CHECK: Retire Control Unit - number of cycles where we saw N instructions retired:
				# CHECK-NEXT: [# retired], [# cycles]
				# CHECK-NEXT: 0, 14 (66.7%)
				# CHECK-NEXT: 1, 4 (19.0%)
				# CHECK-NEXT: 2, 1 (4.8%)
				# CHECK-NEXT: 3, 2 (9.5%)

				# CHECK: Total ROB Entries: 64
				# CHECK-NEXT: Max Used ROB Entries: 6 ( 9.4% )
				# CHECK-NEXT: Average Used ROB Entries per cy: 2 ( 3.1% )

				# CHECK: Register File statistics:
				# CHECK-NEXT: Total number of mappings created: 14
				# CHECK-NEXT: Max number of mappings used: 6

				# CHECK: Resources:
				# CHECK-NEXT: [0.0] - CortexA55UnitALU
				# CHECK-NEXT: [0.1] - CortexA55UnitALU
				# CHECK-NEXT: [1] - CortexA55UnitB
				# CHECK-NEXT: [2] - CortexA55UnitDiv
				# CHECK-NEXT: [3.0] - CortexA55UnitFPALU
				# CHECK-NEXT: [3.1] - CortexA55UnitFPALU
				# CHECK-NEXT: [4] - CortexA55UnitFPDIV
				# CHECK-NEXT: [5.0] - CortexA55UnitFPMAC
				# CHECK-NEXT: [5.1] - CortexA55UnitFPMAC
				# CHECK-NEXT: [6] - CortexA55UnitLd
				# CHECK-NEXT: [7] - CortexA55UnitMAC
				# CHECK-NEXT: [8] - CortexA55UnitSt

				# CHECK: Resource pressure per iteration:
				# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8]
				# CHECK-NEXT: 1.00 1.00 - - - - - - - 2.00 1.00 1.00

				# CHECK: Resource pressure by instruction:
				# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8] Instructions:
				# CHECK-NEXT: - - - - - - - - - 1.00 - - ldr w4, [x2], #4
				# CHECK-NEXT: - - - - - - - - - 1.00 - - ldr w5, [x3]
				# CHECK-NEXT: - - - - - - - - - - 1.00 - madd w0, w5, w4, w0
				# CHECK-NEXT: - 1.00 - - - - - - - - - - add x3, x3, x13
				# CHECK-NEXT: 1.00 - - - - - - - - - - - subs x1, x1, #1
				# CHECK-NEXT: - - - - - - - - - - - 1.00 str w0, [x21, x18, lsl #2]

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
				# RUN: llvm-mca -mtriple=aarch64 -mcpu=cortex-a55 --all-views --iterations=2 < %s \| FileCheck %s

				ldr w4, [x2], #4
				ldr w5, [x3]
				madd w0, w5, w4, w0
				add x3, x3, x13
				subs x1, x1, #1
				str w0, [x21, x18, lsl #2]

				# CHECK: Iterations: 2
				# CHECK-NEXT: Instructions: 12
				# CHECK-NEXT: Total Cycles: 21
				# CHECK-NEXT: Total uOps: 14

				# CHECK: Dispatch Width: 2
				# CHECK-NEXT: uOps Per Cycle: 0.67
				# CHECK-NEXT: IPC: 0.57
				# CHECK-NEXT: Block RThroughput: 3.5

				# CHECK: Cycles with backend pressure increase [ 19.05% ]
				# CHECK-NEXT: Throughput Bottlenecks:
				# CHECK-NEXT: Resource Pressure [ 0.00% ]
				# CHECK-NEXT: Data Dependencies: [ 19.05% ]
				# CHECK-NEXT: - Register Dependencies [ 19.05% ]
				# CHECK-NEXT: - Memory Dependencies [ 0.00% ]

				# CHECK: Instruction Info:
				# CHECK-NEXT: [1]: #uOps
				# CHECK-NEXT: [2]: Latency
				# CHECK-NEXT: [3]: RThroughput
				# CHECK-NEXT: [4]: MayLoad
				# CHECK-NEXT: [5]: MayStore
				# CHECK-NEXT: [6]: HasSideEffects (U)

				# CHECK: [1] [2] [3] [4] [5] [6] Instructions:
				# CHECK-NEXT: 2 3 1.00 * ldr w4, [x2], #4
				# CHECK-NEXT: 1 3 1.00 * ldr w5, [x3]
				# CHECK-NEXT: 1 4 1.00 madd w0, w5, w4, w0
				# CHECK-NEXT: 1 3 0.50 add x3, x3, x13
				# CHECK-NEXT: 1 3 0.50 subs x1, x1, #1
				# CHECK-NEXT: 1 4 1.00 * str w0, [x21, x18, lsl #2]

				# CHECK: Dynamic Dispatch Stall Cycles:
				# CHECK-NEXT: RAT - Register unavailable: 10 (47.6%)
				# CHECK-NEXT: RCU - Retire tokens unavailable: 0
				# CHECK-NEXT: SCHEDQ - Scheduler full: 0
				# CHECK-NEXT: LQ - Load queue full: 0
				# CHECK-NEXT: SQ - Store queue full: 0
				# CHECK-NEXT: GROUP - Static restrictions on the dispatch group: 0

				# CHECK: Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
				# CHECK-NEXT: [# dispatched], [# cycles]
				# CHECK-NEXT: 0, 11 (52.4%)
				# CHECK-NEXT: 1, 6 (28.6%)
				# CHECK-NEXT: 2, 4 (19.0%)

				# CHECK: Schedulers - number of cycles where we saw N micro opcodes issued:
				# CHECK-NEXT: [# issued], [# cycles]
				# CHECK-NEXT: 0, 11 (52.4%)
				# CHECK-NEXT: 1, 6 (28.6%)
				# CHECK-NEXT: 2, 4 (19.0%)

				# CHECK: Scheduler's queue usage:
				# CHECK-NEXT: No scheduler resources used.

				# CHECK: Retire Control Unit - number of cycles where we saw N instructions retired:
				# CHECK-NEXT: [# retired], [# cycles]
				# CHECK-NEXT: 0, 14 (66.7%)
				# CHECK-NEXT: 1, 4 (19.0%)
				# CHECK-NEXT: 2, 1 (4.8%)
				# CHECK-NEXT: 3, 2 (9.5%)

				# CHECK: Total ROB Entries: 64
				# CHECK-NEXT: Max Used ROB Entries: 6 ( 9.4% )
				# CHECK-NEXT: Average Used ROB Entries per cy: 2 ( 3.1% )

				# CHECK: Register File statistics:
				# CHECK-NEXT: Total number of mappings created: 14
				# CHECK-NEXT: Max number of mappings used: 6

				# CHECK: Resources:
				# CHECK-NEXT: [0.0] - CortexA55UnitALU
				# CHECK-NEXT: [0.1] - CortexA55UnitALU
				# CHECK-NEXT: [1] - CortexA55UnitB
				# CHECK-NEXT: [2] - CortexA55UnitDiv
				# CHECK-NEXT: [3.0] - CortexA55UnitFPALU
				# CHECK-NEXT: [3.1] - CortexA55UnitFPALU
				# CHECK-NEXT: [4] - CortexA55UnitFPDIV
				# CHECK-NEXT: [5.0] - CortexA55UnitFPMAC
				# CHECK-NEXT: [5.1] - CortexA55UnitFPMAC
				# CHECK-NEXT: [6] - CortexA55UnitLd
				# CHECK-NEXT: [7] - CortexA55UnitMAC
				# CHECK-NEXT: [8] - CortexA55UnitSt

				# CHECK: Resource pressure per iteration:
				# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8]
				# CHECK-NEXT: 1.00 1.00 - - - - - - - 2.00 1.00 1.00

				# CHECK: Resource pressure by instruction:
				# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8] Instructions:
				# CHECK-NEXT: - - - - - - - - - 1.00 - - ldr w4, [x2], #4
				# CHECK-NEXT: - - - - - - - - - 1.00 - - ldr w5, [x3]
				# CHECK-NEXT: - - - - - - - - - - 1.00 - madd w0, w5, w4, w0
				# CHECK-NEXT: - 1.00 - - - - - - - - - - add x3, x3, x13
				# CHECK-NEXT: 1.00 - - - - - - - - - - - subs x1, x1, #1
				# CHECK-NEXT: - - - - - - - - - - - 1.00 str w0, [x21, x18, lsl #2]

				# CHECK: Timeline view:
				# CHECK-NEXT: 0123456789
				# CHECK-NEXT: Index 0123456789 0

				# CHECK: [0,0] DeeER. . . . ldr w4, [x2], #4
				# CHECK-NEXT: [0,1] .DeeER . . . ldr w5, [x3]
				# CHECK-NEXT: [0,2] . DeeeER. . . madd w0, w5, w4, w0
				# CHECK-NEXT: [0,3] . DeeE-R. . . add x3, x3, x13
				# CHECK-NEXT: [0,4] . DeeER. . . subs x1, x1, #1
				# CHECK-NEXT: [0,5] . . DeeeER . . str w0, [x21, x18, lsl #2]
				andreadbUnsubmitted Not Done Reply Inline Actions Why are these two executing out of order? andreadb: Why are these two executing out of order?
				asavonicAuthorUnsubmitted Not Done Reply Inline Actions Madd and add are issued in the same cycle, subs is issued next. However, they should not retire out-of-order. Some instructions can retire out-of-order, but not these. I have to look into this. Probably an RCU is actually needed for the in-order pipeline. asavonic: Madd and add are issued in the same cycle, subs is issued next. However, they should not retire…
				andreadbUnsubmitted Not Done Reply Inline Actions In theory, younger instructions should not be allowed to reach the write-back stage before older instructions because that would lead to out-of-order execution. In this case I was expecting a compulsory stall to artificially delay the issue of the `add` so that it can write-back in-order w.r.t. the madd. What are those cases where it is allowed to write-back instructions out of order? Shouldn't architectural commits always happen in-order? andreadb: In theory, younger instructions should not be allowed to reach the write-back stage before…
				asavonicAuthorUnsubmitted Done Reply Inline Actions In theory, younger instructions should not be allowed to reach the write-back stage before older instructions because that would lead to out-of-order execution. In this case I was expecting a compulsory stall to artificially delay the issue of the `add` so that it can write-back in-order w.r.t. the madd. I wonder how this works for instructions with early termination (sdiv, udiv). @dmgreen, can you please comment on this? What are those cases where it is allowed to write-back instructions out of order? Shouldn't architectural commits always happen in-order? From Cortex-A55 optimization manual, s3.5.1 "Instructions with out-of-order completion": While the Cortex-A55 core only issues instructions in-order, due to the number of cycles required to complete more complex floating-point and NEON instructions, out-of-order retire is allowed on the instructions described in this section. The nature of the Cortex-A55 microarchitecture is such that NEON and floating-point instructions of the same type have the same timing characteristics. asavonic: > In theory, younger instructions should not be allowed to reach the write-back stage before…
				dmgreenUnsubmitted Not Done Reply Inline Actions Yeah, I was about to quote the same thing! The Cortex-R52, another in-order AArch32 cpu similar to the A53 mentions: https://developer.arm.com/documentation/100026/0103/Cycle-Timings-and-Interlock-Behavior/Pipeline-behavior/Dual-issuing If the Cortex-R52 processor determines that the pair must be dual-issued, it remains a pair until both instructions are retired. So I believe it can depend on the CPU and possibly the instructions issued. dmgreen: Yeah, I was about to quote the same thing! The Cortex-R52, another in-order AArch32 cpu…
				andreadbUnsubmitted Not Done Reply Inline Actions I see. That unfortunately complicates the whole implementation. We do need a Retire Stage after all... Is out-of-order execution limited in some ways in these processors? For example: do these processors perform register renaming to break false dependencies? If we don't need to worry about register renaming, then the implementation is simpler. I really hope that out-of-order execution is only really allowed in case of completely independent instructions , and just to avoid any bottlenecks caused by long latency instructions... By the way, do we have a mechanism in place to identify instructions that can be executed out-of-order? I don't remember to have ever seen anything like that in the scheduling model. If not, then we need a way to mark those instructions somehow. In case, we might be and be able to reuse the MCSchedPredicate framework for that (fingers crossed). andreadb: I see. That unfortunately complicates the whole implementation. We do need a Retire Stage after…
				asavonicAuthorUnsubmitted Done Reply Inline Actions I changed the RetireStage, RCU and the scheduler model to support this. RetireControlUnit is now used if it is defined in the scheduler model. I ran a couple of tests on hardware, and MCA results seem to match it pretty well. Some instructions require out-of-order retire. I added RetireOOO flag to MCSchedClass to handle this feature. Instructions with this flag do not block other instructions and may retire as soon as they complete execution. asavonic: I changed the RetireStage, RCU and the scheduler model to support this. RetireControlUnit is…
				# CHECK-NEXT: [1,0] . . DeeER . . ldr w4, [x2], #4
				# CHECK-NEXT: [1,1] . . DeeER . . ldr w5, [x3]
				# CHECK-NEXT: [1,2] . . . DeeeER . madd w0, w5, w4, w0
				# CHECK-NEXT: [1,3] . . . DeeE-R . add x3, x3, x13
				# CHECK-NEXT: [1,4] . . . DeeER . subs x1, x1, #1
				# CHECK-NEXT: [1,5] . . . DeeeER str w0, [x21, x18, lsl #2]

				# CHECK: Average Wait times (based on the timeline view):
				# CHECK-NEXT: [0]: Executions
				andreadbUnsubmitted Not Done Reply Inline Actions Interesting. According to this design, the "dispatch" event is still decoupled from the "issue" event. Is that expected? I am asking because in my mind, at least for in-order processors, the two events should coincide. Basically there is no reservation station, and uOPs are directly issued to the underlying execution unit at a IssueWidth maximum rate. andreadb: Interesting. According to this design, the "dispatch" event is still decoupled from the "issue"…
				asavonicAuthorUnsubmitted Done Reply Inline Actions Fixed. Now a dispatch event and an issue event occur in the same cycle. asavonic: Fixed. Now a dispatch event and an issue event occur in the same cycle.
				# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
				# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
				# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage

				# CHECK: [0] [1] [2] [3]
				# CHECK-NEXT: 0. 2 0.0 0.0 0.0 ldr w4, [x2], #4
				# CHECK-NEXT: 1. 2 0.0 0.0 0.0 ldr w5, [x3]
				# CHECK-NEXT: 2. 2 0.0 0.0 0.0 madd w0, w5, w4, w0
				# CHECK-NEXT: 3. 2 0.0 0.0 1.0 add x3, x3, x13
				# CHECK-NEXT: 4. 2 0.0 0.0 0.0 subs x1, x1, #1
				# CHECK-NEXT: 5. 2 0.0 0.0 0.0 str w0, [x21, x18, lsl #2]
				# CHECK-NEXT: 2 0.0 0.0 0.2 <total>

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-in-order-retire.s

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
				# RUN: llvm-mca -mtriple=aarch64 -mcpu=cortex-a55 --all-stats --all-views --iterations=2 < %s \| FileCheck %s

				sdiv w12, w21, w0
				add w8, w8, #1
				add w1, w2, w0
				add w3, w4, #1
				add w5, w6, w0
				add w7, w9, w0

				# CHECK: Iterations: 2
				# CHECK-NEXT: Instructions: 12
				# CHECK-NEXT: Total Cycles: 18
				# CHECK-NEXT: Total uOps: 12

				# CHECK: Dispatch Width: 2
				# CHECK-NEXT: uOps Per Cycle: 0.67
				# CHECK-NEXT: IPC: 0.67
				# CHECK-NEXT: Block RThroughput: 8.0

				# CHECK: Cycles with backend pressure increase [ 27.78% ]
				# CHECK-NEXT: Throughput Bottlenecks:
				# CHECK-NEXT: Resource Pressure [ 27.78% ]
				# CHECK-NEXT: Data Dependencies: [ 0.00% ]
				# CHECK-NEXT: - Register Dependencies [ 0.00% ]
				# CHECK-NEXT: - Memory Dependencies [ 0.00% ]

				# CHECK: Instruction Info:
				# CHECK-NEXT: [1]: #uOps
				# CHECK-NEXT: [2]: Latency
				# CHECK-NEXT: [3]: RThroughput
				# CHECK-NEXT: [4]: MayLoad
				# CHECK-NEXT: [5]: MayStore
				# CHECK-NEXT: [6]: HasSideEffects (U)

				# CHECK: [1] [2] [3] [4] [5] [6] Instructions:
				# CHECK-NEXT: 1 8 8.00 sdiv w12, w21, w0
				# CHECK-NEXT: 1 3 0.50 add w8, w8, #1
				# CHECK-NEXT: 1 3 0.50 add w1, w2, w0
				# CHECK-NEXT: 1 3 0.50 add w3, w4, #1
				# CHECK-NEXT: 1 3 0.50 add w5, w6, w0
				# CHECK-NEXT: 1 3 0.50 add w7, w9, w0

				# CHECK: Dynamic Dispatch Stall Cycles:
				# CHECK-NEXT: RAT - Register unavailable: 0
				# CHECK-NEXT: RCU - Retire tokens unavailable: 0
				# CHECK-NEXT: SCHEDQ - Scheduler full: 0
				# CHECK-NEXT: LQ - Load queue full: 0
				# CHECK-NEXT: SQ - Store queue full: 0
				# CHECK-NEXT: GROUP - Static restrictions on the dispatch group: 5 (27.8%)

				# CHECK: Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
				# CHECK-NEXT: [# dispatched], [# cycles]
				# CHECK-NEXT: 0, 12 (66.7%)
				# CHECK-NEXT: 2, 6 (33.3%)

				# CHECK: Schedulers - number of cycles where we saw N micro opcodes issued:
				# CHECK-NEXT: [# issued], [# cycles]
				# CHECK-NEXT: 0, 12 (66.7%)
				# CHECK-NEXT: 2, 6 (33.3%)

				# CHECK: Scheduler's queue usage:
				# CHECK-NEXT: No scheduler resources used.

				# CHECK: Retire Control Unit - number of cycles where we saw N instructions retired:
				# CHECK-NEXT: [# retired], [# cycles]
				# CHECK-NEXT: 0, 16 (88.9%)
				# CHECK-NEXT: 6, 2 (11.1%)

				# CHECK: Total ROB Entries: 64
				# CHECK-NEXT: Max Used ROB Entries: 8 ( 12.5% )
				# CHECK-NEXT: Average Used ROB Entries per cy: 5 ( 7.8% )

				# CHECK: Register File statistics:
				# CHECK-NEXT: Total number of mappings created: 12
				# CHECK-NEXT: Max number of mappings used: 8

				# CHECK: Resources:
				# CHECK-NEXT: [0.0] - CortexA55UnitALU
				# CHECK-NEXT: [0.1] - CortexA55UnitALU
				# CHECK-NEXT: [1] - CortexA55UnitB
				# CHECK-NEXT: [2] - CortexA55UnitDiv
				# CHECK-NEXT: [3.0] - CortexA55UnitFPALU
				# CHECK-NEXT: [3.1] - CortexA55UnitFPALU
				# CHECK-NEXT: [4] - CortexA55UnitFPDIV
				# CHECK-NEXT: [5.0] - CortexA55UnitFPMAC
				# CHECK-NEXT: [5.1] - CortexA55UnitFPMAC
				# CHECK-NEXT: [6] - CortexA55UnitLd
				# CHECK-NEXT: [7] - CortexA55UnitMAC
				# CHECK-NEXT: [8] - CortexA55UnitSt

				# CHECK: Resource pressure per iteration:
				# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8]
				# CHECK-NEXT: 2.50 2.50 - 8.00 - - - - - - - -

				# CHECK: Resource pressure by instruction:
				# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8] Instructions:
				# CHECK-NEXT: - - - 8.00 - - - - - - - - sdiv w12, w21, w0
				# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w8, w8, #1
				# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w1, w2, w0
				# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w3, w4, #1
				# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w5, w6, w0
				# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w7, w9, w0

				# CHECK: Timeline view:
				# CHECK-NEXT: 01234567
				# CHECK-NEXT: Index 0123456789

				# CHECK: [0,0] DeeeeeeeER. . . sdiv w12, w21, w0
				# CHECK-NEXT: [0,1] DeeE-----R. . . add w8, w8, #1
				# CHECK-NEXT: [0,2] .DeeE----R. . . add w1, w2, w0
				# CHECK-NEXT: [0,3] .DeeE----R. . . add w3, w4, #1
				# CHECK-NEXT: [0,4] . DeeE---R. . . add w5, w6, w0
				# CHECK-NEXT: [0,5] . DeeE---R. . . add w7, w9, w0
				# CHECK-NEXT: [1,0] . . DeeeeeeeER sdiv w12, w21, w0
				# CHECK-NEXT: [1,1] . . DeeE-----R add w8, w8, #1
				# CHECK-NEXT: [1,2] . . DeeE----R add w1, w2, w0
				# CHECK-NEXT: [1,3] . . DeeE----R add w3, w4, #1
				# CHECK-NEXT: [1,4] . . DeeE---R add w5, w6, w0
				# CHECK-NEXT: [1,5] . . DeeE---R add w7, w9, w0

				# CHECK: Average Wait times (based on the timeline view):
				# CHECK-NEXT: [0]: Executions
				# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
				# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
				# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage

				# CHECK: [0] [1] [2] [3]
				# CHECK-NEXT: 0. 2 0.0 0.0 0.0 sdiv w12, w21, w0
				# CHECK-NEXT: 1. 2 0.0 0.0 5.0 add w8, w8, #1
				# CHECK-NEXT: 2. 2 0.0 0.0 4.0 add w1, w2, w0
				# CHECK-NEXT: 3. 2 0.0 0.0 4.0 add w3, w4, #1
				# CHECK-NEXT: 4. 2 0.0 0.0 3.0 add w5, w6, w0
				# CHECK-NEXT: 5. 2 0.0 0.0 3.0 add w7, w9, w0
				# CHECK-NEXT: 2 0.0 0.0 3.2 <total>

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-out-of-order-retire.s

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
				# RUN: llvm-mca -mtriple=aarch64 -mcpu=cortex-a55 --all-stats --all-views --iterations=2 < %s \| FileCheck %s

				fdiv s1, s2, s3
				add w8, w8, #1
				add w1, w2, w0
				add w3, w4, #1
				add w5, w6, w0
				add w7, w9, w0

				# CHECK: Iterations: 2
				# CHECK-NEXT: Instructions: 12
				# CHECK-NEXT: Total Cycles: 25
				# CHECK-NEXT: Total uOps: 12

				# CHECK: Dispatch Width: 2
				# CHECK-NEXT: uOps Per Cycle: 0.48
				# CHECK-NEXT: IPC: 0.48
				# CHECK-NEXT: Block RThroughput: 10.0

				# CHECK: Cycles with backend pressure increase [ 28.00% ]
				# CHECK-NEXT: Throughput Bottlenecks:
				# CHECK-NEXT: Resource Pressure [ 28.00% ]
				# CHECK-NEXT: Data Dependencies: [ 0.00% ]
				# CHECK-NEXT: - Register Dependencies [ 0.00% ]
				# CHECK-NEXT: - Memory Dependencies [ 0.00% ]

				# CHECK: Instruction Info:
				# CHECK-NEXT: [1]: #uOps
				# CHECK-NEXT: [2]: Latency
				# CHECK-NEXT: [3]: RThroughput
				# CHECK-NEXT: [4]: MayLoad
				# CHECK-NEXT: [5]: MayStore
				# CHECK-NEXT: [6]: HasSideEffects (U)

				# CHECK: [1] [2] [3] [4] [5] [6] Instructions:
				# CHECK-NEXT: 1 13 10.00 fdiv s1, s2, s3
				# CHECK-NEXT: 1 3 0.50 add w8, w8, #1
				# CHECK-NEXT: 1 3 0.50 add w1, w2, w0
				# CHECK-NEXT: 1 3 0.50 add w3, w4, #1
				# CHECK-NEXT: 1 3 0.50 add w5, w6, w0
				# CHECK-NEXT: 1 3 0.50 add w7, w9, w0

				# CHECK: Dynamic Dispatch Stall Cycles:
				# CHECK-NEXT: RAT - Register unavailable: 0
				# CHECK-NEXT: RCU - Retire tokens unavailable: 0
				# CHECK-NEXT: SCHEDQ - Scheduler full: 0
				# CHECK-NEXT: LQ - Load queue full: 0
				# CHECK-NEXT: SQ - Store queue full: 0
				# CHECK-NEXT: GROUP - Static restrictions on the dispatch group: 7 (28.0%)

				# CHECK: Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
				# CHECK-NEXT: [# dispatched], [# cycles]
				# CHECK-NEXT: 0, 19 (76.0%)
				# CHECK-NEXT: 2, 6 (24.0%)

				# CHECK: Schedulers - number of cycles where we saw N micro opcodes issued:
				# CHECK-NEXT: [# issued], [# cycles]
				# CHECK-NEXT: 0, 19 (76.0%)
				# CHECK-NEXT: 2, 6 (24.0%)

				# CHECK: Scheduler's queue usage:
				# CHECK-NEXT: No scheduler resources used.

				# CHECK: Retire Control Unit - number of cycles where we saw N instructions retired:
				# CHECK-NEXT: [# retired], [# cycles]
				# CHECK-NEXT: 0, 18 (72.0%)
				# CHECK-NEXT: 1, 2 (8.0%)
				# CHECK-NEXT: 2, 5 (20.0%)

				# CHECK: Total ROB Entries: 64
				# CHECK-NEXT: Max Used ROB Entries: 7 ( 10.9% )
				# CHECK-NEXT: Average Used ROB Entries per cy: 2 ( 3.1% )

				# CHECK: Register File statistics:
				# CHECK-NEXT: Total number of mappings created: 12
				# CHECK-NEXT: Max number of mappings used: 7

				# CHECK: Resources:
				# CHECK-NEXT: [0.0] - CortexA55UnitALU
				# CHECK-NEXT: [0.1] - CortexA55UnitALU
				# CHECK-NEXT: [1] - CortexA55UnitB
				# CHECK-NEXT: [2] - CortexA55UnitDiv
				# CHECK-NEXT: [3.0] - CortexA55UnitFPALU
				# CHECK-NEXT: [3.1] - CortexA55UnitFPALU
				# CHECK-NEXT: [4] - CortexA55UnitFPDIV
				# CHECK-NEXT: [5.0] - CortexA55UnitFPMAC
				# CHECK-NEXT: [5.1] - CortexA55UnitFPMAC
				# CHECK-NEXT: [6] - CortexA55UnitLd
				# CHECK-NEXT: [7] - CortexA55UnitMAC
				# CHECK-NEXT: [8] - CortexA55UnitSt

				# CHECK: Resource pressure per iteration:
				# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8]
				# CHECK-NEXT: 2.50 2.50 - - - - 10.00 - - - - -

				# CHECK: Resource pressure by instruction:
				# CHECK-NEXT: [0.0] [0.1] [1] [2] [3.0] [3.1] [4] [5.0] [5.1] [6] [7] [8] Instructions:
				# CHECK-NEXT: - - - - - - 10.00 - - - - - fdiv s1, s2, s3
				# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w8, w8, #1
				# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w1, w2, w0
				# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w3, w4, #1
				# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w5, w6, w0
				# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w7, w9, w0

				# CHECK: Timeline view:
				# CHECK-NEXT: 0123456789
				# CHECK-NEXT: Index 0123456789 01234

				# CHECK: [0,0] DeeeeeeeeeeeeER. . . fdiv s1, s2, s3
				# CHECK-NEXT: [0,1] DeeER. . . . . add w8, w8, #1
				# CHECK-NEXT: [0,2] .DeeER . . . . add w1, w2, w0
				# CHECK-NEXT: [0,3] .DeeER . . . . add w3, w4, #1
				# CHECK-NEXT: [0,4] . DeeER . . . . add w5, w6, w0
				# CHECK-NEXT: [0,5] . DeeER . . . . add w7, w9, w0
				# CHECK-NEXT: [1,0] . . DeeeeeeeeeeeeER fdiv s1, s2, s3
				# CHECK-NEXT: [1,1] . . DeeER. . . add w8, w8, #1
				# CHECK-NEXT: [1,2] . . .DeeER . . add w1, w2, w0
				# CHECK-NEXT: [1,3] . . .DeeER . . add w3, w4, #1
				# CHECK-NEXT: [1,4] . . . DeeER . . add w5, w6, w0
				# CHECK-NEXT: [1,5] . . . DeeER . . add w7, w9, w0

				# CHECK: Average Wait times (based on the timeline view):
				# CHECK-NEXT: [0]: Executions
				# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
				# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
				# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage

				# CHECK: [0] [1] [2] [3]
				# CHECK-NEXT: 0. 2 0.0 0.0 0.0 fdiv s1, s2, s3
				# CHECK-NEXT: 1. 2 0.0 0.0 0.0 add w8, w8, #1
				# CHECK-NEXT: 2. 2 0.0 0.0 0.0 add w1, w2, w0
				# CHECK-NEXT: 3. 2 0.0 0.0 0.0 add w3, w4, #1
				# CHECK-NEXT: 4. 2 0.0 0.0 0.0 add w5, w6, w0
				# CHECK-NEXT: 5. 2 0.0 0.0 0.0 add w7, w9, w0
				# CHECK-NEXT: 2 0.0 0.0 0.0 <total>

llvm/test/tools/llvm-mca/X86/in-order-cpu.s

	# RUN: not llvm-mca %s -mtriple=x86_64-unknown-unknown -mcpu=atom -o /dev/null 2>&1 \| FileCheck %s			# RUN: llvm-mca %s -mtriple=x86_64-unknown-unknown -mcpu=atom -o /dev/null 2>&1 \| FileCheck %s
				# CHECK: warning: support for in-order CPU 'atom' is experimental.
	# CHECK: error: please specify an out-of-order cpu. 'atom' is an in-order cpu.			movsbw %al, %di

llvm/tools/llvm-mca/llvm-mca.cpp

Show First 20 Lines • Show All 329 Lines • ▼ Show 20 Lines	int main(int argc, char **argv) {

std::unique_ptr<MCSubtargetInfo> STI(		std::unique_ptr<MCSubtargetInfo> STI(
TheTarget->createMCSubtargetInfo(TripleName, MCPU, MATTR));		TheTarget->createMCSubtargetInfo(TripleName, MCPU, MATTR));
assert(STI && "Unable to create subtarget info!");		assert(STI && "Unable to create subtarget info!");
if (!STI->isCPUStringValid(MCPU))		if (!STI->isCPUStringValid(MCPU))
return 1;		return 1;

if (!PrintInstructionTables && !STI->getSchedModel().isOutOfOrder()) {		if (!PrintInstructionTables && !STI->getSchedModel().isOutOfOrder()) {
WithColor::error() << "please specify an out-of-order cpu. '" << MCPU		WithColor::warning() << "support for in-order CPU '" << MCPU
<< "' is an in-order cpu.\n";		<< "' is experimental.\n";
return 1;
}		}

if (!STI->getSchedModel().hasInstrSchedModel()) {		if (!STI->getSchedModel().hasInstrSchedModel()) {
WithColor::error()		WithColor::error()
<< "unable to find instruction-level scheduling information for"		<< "unable to find instruction-level scheduling information for"
<< " target triple '" << TheTriple.normalize() << "' and cpu '" << MCPU		<< " target triple '" << TheTriple.normalize() << "' and cpu '" << MCPU
<< "'.\n";		<< "'.\n";

▲ Show 20 Lines • Show All 225 Lines • Show Last 20 Lines

llvm/utils/TableGen/SubtargetEmitter.cpp

Show First 20 Lines • Show All 993 Lines • ▼ Show 20 Lines	for (const CodeGenSchedClass &SC : SchedModels.schedClasses()) {
LLVM_DEBUG(SC.dump(&SchedModels));		LLVM_DEBUG(SC.dump(&SchedModels));

SCTab.resize(SCTab.size() + 1);		SCTab.resize(SCTab.size() + 1);
MCSchedClassDesc &SCDesc = SCTab.back();		MCSchedClassDesc &SCDesc = SCTab.back();
// SCDesc.Name is guarded by NDEBUG		// SCDesc.Name is guarded by NDEBUG
SCDesc.NumMicroOps = 0;		SCDesc.NumMicroOps = 0;
SCDesc.BeginGroup = false;		SCDesc.BeginGroup = false;
SCDesc.EndGroup = false;		SCDesc.EndGroup = false;
		SCDesc.RetireOOO = false;
SCDesc.WriteProcResIdx = 0;		SCDesc.WriteProcResIdx = 0;
SCDesc.WriteLatencyIdx = 0;		SCDesc.WriteLatencyIdx = 0;
SCDesc.ReadAdvanceIdx = 0;		SCDesc.ReadAdvanceIdx = 0;

// A Variant SchedClass has no resources of its own.		// A Variant SchedClass has no resources of its own.
bool HasVariants = false;		bool HasVariants = false;
for (const CodeGenSchedTransition &CGT :		for (const CodeGenSchedTransition &CGT :
make_range(SC.Transitions.begin(), SC.Transitions.end())) {		make_range(SC.Transitions.begin(), SC.Transitions.end())) {
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	for (unsigned W : Writes) {
break;		break;
}		}
WLEntry.Cycles += WriteRes->getValueAsInt("Latency");		WLEntry.Cycles += WriteRes->getValueAsInt("Latency");
SCDesc.NumMicroOps += WriteRes->getValueAsInt("NumMicroOps");		SCDesc.NumMicroOps += WriteRes->getValueAsInt("NumMicroOps");
SCDesc.BeginGroup \|= WriteRes->getValueAsBit("BeginGroup");		SCDesc.BeginGroup \|= WriteRes->getValueAsBit("BeginGroup");
SCDesc.EndGroup \|= WriteRes->getValueAsBit("EndGroup");		SCDesc.EndGroup \|= WriteRes->getValueAsBit("EndGroup");
SCDesc.BeginGroup \|= WriteRes->getValueAsBit("SingleIssue");		SCDesc.BeginGroup \|= WriteRes->getValueAsBit("SingleIssue");
SCDesc.EndGroup \|= WriteRes->getValueAsBit("SingleIssue");		SCDesc.EndGroup \|= WriteRes->getValueAsBit("SingleIssue");
		SCDesc.RetireOOO \|= WriteRes->getValueAsBit("RetireOOO");
		andreadbUnsubmitted Not Done Reply Inline Actions Basically, an instruction is allowed to retire out-of-order if `RetireOOO` is true for at least one of its writes. This field is only meaningful for in-order subtargets, and is ignored for other targets. I think that this should be better described in a comment (in TargetPredicates.td). andreadb: Basically, an instruction is allowed to retire out-of-order if `RetireOOO` is true for at least…

// Create an entry for each ProcResource listed in WriteRes.		// Create an entry for each ProcResource listed in WriteRes.
RecVec PRVec = WriteRes->getValueAsListOfDefs("ProcResources");		RecVec PRVec = WriteRes->getValueAsListOfDefs("ProcResources");
std::vector<int64_t> Cycles =		std::vector<int64_t> Cycles =
WriteRes->getValueAsListOfInts("ResourceCycles");		WriteRes->getValueAsListOfInts("ResourceCycles");

if (Cycles.empty()) {		if (Cycles.empty()) {
// If ResourceCycles is not provided, default to one cycle per		// If ResourceCycles is not provided, default to one cycle per
▲ Show 20 Lines • Show All 182 Lines • ▼ Show 20 Lines	void SubtargetEmitter::EmitSchedClassTables(SchedClassTables &SchedTables,
for (CodeGenSchedModels::ProcIter PI = SchedModels.procModelBegin(),		for (CodeGenSchedModels::ProcIter PI = SchedModels.procModelBegin(),
PE = SchedModels.procModelEnd(); PI != PE; ++PI) {		PE = SchedModels.procModelEnd(); PI != PE; ++PI) {
if (!PI->hasInstrSchedModel())		if (!PI->hasInstrSchedModel())
continue;		continue;

std::vector<MCSchedClassDesc> &SCTab =		std::vector<MCSchedClassDesc> &SCTab =
SchedTables.ProcSchedClasses[1 + (PI - SchedModels.procModelBegin())];		SchedTables.ProcSchedClasses[1 + (PI - SchedModels.procModelBegin())];

OS << "\n// {Name, NumMicroOps, BeginGroup, EndGroup,"		OS << "\n// {Name, NumMicroOps, BeginGroup, EndGroup, RetireOOO,"
<< " WriteProcResIdx,#, WriteLatencyIdx,#, ReadAdvanceIdx,#}\n";		<< " WriteProcResIdx,#, WriteLatencyIdx,#, ReadAdvanceIdx,#}\n";
OS << "static const llvm::MCSchedClassDesc "		OS << "static const llvm::MCSchedClassDesc "
<< PI->ModelName << "SchedClasses[] = {\n";		<< PI->ModelName << "SchedClasses[] = {\n";

// The first class is always invalid. We no way to distinguish it except by		// The first class is always invalid. We no way to distinguish it except by
// name and position.		// name and position.
assert(SchedModels.getSchedClass(0).Name == "NoInstrModel"		assert(SchedModels.getSchedClass(0).Name == "NoInstrModel"
&& "invalid class not first");		&& "invalid class not first");
OS << " {DBGFIELD(\"InvalidSchedClass\") "		OS << " {DBGFIELD(\"InvalidSchedClass\") "
<< MCSchedClassDesc::InvalidNumMicroOps		<< MCSchedClassDesc::InvalidNumMicroOps
<< ", false, false, 0, 0, 0, 0, 0, 0},\n";		<< ", false, false, false, 0, 0, 0, 0, 0, 0},\n";

for (unsigned SCIdx = 1, SCEnd = SCTab.size(); SCIdx != SCEnd; ++SCIdx) {		for (unsigned SCIdx = 1, SCEnd = SCTab.size(); SCIdx != SCEnd; ++SCIdx) {
MCSchedClassDesc &MCDesc = SCTab[SCIdx];		MCSchedClassDesc &MCDesc = SCTab[SCIdx];
const CodeGenSchedClass &SchedClass = SchedModels.getSchedClass(SCIdx);		const CodeGenSchedClass &SchedClass = SchedModels.getSchedClass(SCIdx);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - OS << MCDesc.NumMicroOps - << ", " << ( MCDesc.BeginGroup ? "true" : "false" ) - << ", " << ( MCDesc.EndGroup ? "true" : "false" ) - << ", " << ( MCDesc.RetireOOO ? "true" : "false" ) - << ", " << format("%2d", MCDesc.WriteProcResIdx) - << ", " << MCDesc.NumWriteProcResEntries - << ", " << format("%2d", MCDesc.WriteLatencyIdx) - << ", " << MCDesc.NumWriteLatencyEntries - << ", " << format("%2d", MCDesc.ReadAdvanceIdx) - << ", " << MCDesc.NumReadAdvanceEntries 10 diff lines are omitted. See full path. Lint: Pre-merge checks: clang-format: please reformat the code ``` - OS << MCDesc.NumMicroOps - << ", " <<…
OS << " {DBGFIELD(\"" << SchedClass.Name << "\") ";		OS << " {DBGFIELD(\"" << SchedClass.Name << "\") ";
if (SchedClass.Name.size() < 18)		if (SchedClass.Name.size() < 18)
OS.indent(18 - SchedClass.Name.size());		OS.indent(18 - SchedClass.Name.size());
OS << MCDesc.NumMicroOps		OS << MCDesc.NumMicroOps
<< ", " << ( MCDesc.BeginGroup ? "true" : "false" )		<< ", " << ( MCDesc.BeginGroup ? "true" : "false" )
<< ", " << ( MCDesc.EndGroup ? "true" : "false" )		<< ", " << ( MCDesc.EndGroup ? "true" : "false" )
		<< ", " << ( MCDesc.RetireOOO ? "true" : "false" )
<< ", " << format("%2d", MCDesc.WriteProcResIdx)		<< ", " << format("%2d", MCDesc.WriteProcResIdx)
<< ", " << MCDesc.NumWriteProcResEntries		<< ", " << MCDesc.NumWriteProcResEntries
<< ", " << format("%2d", MCDesc.WriteLatencyIdx)		<< ", " << format("%2d", MCDesc.WriteLatencyIdx)
<< ", " << MCDesc.NumWriteLatencyEntries		<< ", " << MCDesc.NumWriteLatencyEntries
<< ", " << format("%2d", MCDesc.ReadAdvanceIdx)		<< ", " << format("%2d", MCDesc.ReadAdvanceIdx)
<< ", " << MCDesc.NumReadAdvanceEntries		<< ", " << MCDesc.NumReadAdvanceEntries
<< "}, // #" << SCIdx << '\n';		<< "}, // #" << SCIdx << '\n';
}		}
▲ Show 20 Lines • Show All 639 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[llvm-mca] Add support for in-order CPUsClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 318858

llvm/include/llvm/MC/MCSchedule.h

llvm/include/llvm/MCA/Context.h

llvm/include/llvm/MCA/HardwareUnits/RegisterFile.h

llvm/include/llvm/MCA/HardwareUnits/RetireControlUnit.h

llvm/include/llvm/MCA/Instruction.h

llvm/include/llvm/MCA/Stages/InOrderIssueStage.h

llvm/include/llvm/MCA/Stages/RetireStage.h

llvm/include/llvm/Target/TargetSchedule.td

llvm/lib/MCA/CMakeLists.txt

llvm/lib/MCA/Context.cpp

llvm/lib/MCA/HardwareUnits/RetireControlUnit.cpp

llvm/lib/MCA/InstrBuilder.cpp

llvm/lib/MCA/Stages/InOrderIssueStage.cpp

llvm/lib/MCA/Stages/RetireStage.cpp

llvm/lib/Target/AArch64/AArch64SchedA55.td

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-add-sequence.s

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-stats.s

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-in-order-retire.s

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-out-of-order-retire.s

llvm/test/tools/llvm-mca/X86/in-order-cpu.s

llvm/tools/llvm-mca/llvm-mca.cpp

llvm/utils/TableGen/SubtargetEmitter.cpp

[llvm-mca] Add support for in-order CPUs
ClosedPublic