Page MenuHomePhabricator

[llvm-mca] Add support for in-order CPUs
ClosedPublic

Authored by asavonic on Jan 18 2021, 12:06 PM.

Details

Summary

This patch adds a simplified pipeline to support in-order CPUs such as
ARM Cortex-A55.

In-order pipeline implements a simplified version of Dispatch,
Scheduler and Execute stages as a single stage. Entry and Retire
stages are shared for both in-order and out-of-order pipelines.

Diff Detail

Unit TestsFailed

TimeTest
260 msx64 debian > LLVM.TableGen::InvalidMCSchedClassDesc.td
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/bin/llvm-tblgen -gen-subtarget -I /mnt/disks/ssd0/agent/llvm-project/llvm/test/TableGen/../../include /mnt/disks/ssd0/agent/llvm-project/llvm/test/TableGen/InvalidMCSchedClassDesc.td 2>&1 | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck --allow-unused-prefixes=false /mnt/disks/ssd0/agent/llvm-project/llvm/test/TableGen/InvalidMCSchedClassDesc.td
310 msx64 windows > LLVM.TableGen::InvalidMCSchedClassDesc.td
Script: -- : 'RUN: at line 1'; c:\ws\w16-1\llvm-project\premerge-checks\build\bin\llvm-tblgen.exe -gen-subtarget -I C:\ws\w16-1\llvm-project\premerge-checks\llvm\test\TableGen/../../include C:\ws\w16-1\llvm-project\premerge-checks\llvm\test\TableGen\InvalidMCSchedClassDesc.td 2>&1 | c:\ws\w16-1\llvm-project\premerge-checks\build\bin\filecheck.exe --allow-unused-prefixes=false C:\ws\w16-1\llvm-project\premerge-checks\llvm\test\TableGen\InvalidMCSchedClassDesc.td

Event Timeline

asavonic created this revision.Jan 18 2021, 12:06 PM
asavonic requested review of this revision.Jan 18 2021, 12:06 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 18 2021, 12:06 PM
asl added a subscriber: asl.Jan 18 2021, 12:10 PM
RKSimon added inline comments.Jan 19 2021, 3:29 AM
llvm/lib/MCA/CMakeLists.txt
21

sorting

Thanks for working on this feature!

I went through the patch and overall the approach seems sound.

There are however a few questions and concerns about the overall design.
That being said, overall it shouldn't be difficult to address most of my comments.

Speaking about the design:

Your model assumes an unbounded queue of instructions (something like rudimental reservation station) where to store dispatched instructions.

Correct me if I am wrong, but in-order processor don't use a reservation station.
In the absence of structural hazards, if data dependencies are met, then uOPs are directly issued to the underlying execution units.
So the dispatch event is not decoupled from the issue event.

The fact that your patch adds an unbounded queue sounds a bit strange to me. Not sure what @dmgreen
thinks about it. But this basically means that dispatch and issue are different events.

I also noticed how there are no checks on NumMicroOps. Is there a reason why you don't check for it?
In one of the tests, the target is dual-issue. However, there are cycles where three opcodes are dispatched.
See for example the test where two loads are dispatched in a same cycle (with the first load decoded into two uOPs).
Note also that the presence of '=' chars in the timeline represents cycles spent by an instruction while waiting in a scheduler to be issued.
To be honest, I was not expecting to see those characters at all.

Btw. There is also one particular case where two instructions seems to execute out of order.

You can find other comments below.

Thanks,
-Andrea

llvm/include/llvm/MCA/Stages/InOrderIssueStage.h
42–52

Please use unsigned quantities instead of signed integers.

llvm/lib/MCA/Context.cpp
31–33

I suggest to move the support for in-order pipeline composition into a separate function. I think it would help in terms of readability.

31–40

For readability reasons, I suggest to split this method into two methods.

For example, something like if (!SM.isOutOfOrder()) createInOrderPipeline(opts, srcMgr);.

You then need to move all the pipeline composition logic for in-order processors inside createInOrderPipeline().

I think it would be much more readable (just my opinion though).

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
35–38

We also need to check if IR "begins a group".
Instructions that begin a group, must always be the first instructions to be dispatched in a cycle. See how the isAvailable() check is implemented by the DispatchStage.

36

Shouldn't we also add checks on NumMicroOps somewhere?
One of the tests reports two loads dispatched within a same cycle. However one of the loads is 2 uOps, so - correct me if I am wrong - it shouldn't be possible to dispatch two of those in a same cycle.

Out of curiosity:

ldr	w4, [x2], #4

is the resource consumption info correct for that instruction?

50

As mentioned in another comment we should use unsigned quanitites whenever possible. So, int quantities that are only used to describe zero or more cycles should really be converted into unsigned.

108–115

I expect register renaming to only occur for out-of-order processors. So move elimination shouldn't ever happen in this context.

That being said, instructions that are fully eliminated at register renaming stage still have their writes propagated to the register file. So, if you want to correctly handle that case, then you cannot early exit; you still need to add writes.

Note that the only instructions that can be eliminated at register renaming stage are register moves. See the definition and description of tablegen class RegisterFile in
https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Target/TargetSchedule.td

I don't expect any of this to really affect in-order targets. So, personally I suggest to replace the initial if-stmt with an assert (i.e. check tha IS is NOT eliminated).

143

Maybe add a comment here explaining why we are passing 0 as token-id.
Since this is an in-order processor, the retire stage (and therefore the token id) are not really required in practice. We don't expect the retire stage to do anything in particular.

149–151

Bandwidth should be set to zero if IR ends a group.

You probaby need something semantically equivalent to what method DispatchStage::dispatch(InstRef IR) does:

// Check if this instructions ends the dispatch group.                                                                                                                      if (Desc.EndGroup)
  AvailableEntries = 0;
llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s
115–127

Interesting.
According to this design, the "dispatch" event is still decoupled from the "issue" event. Is that expected?
I am asking because in my mind, at least for in-order processors, the two events should coincide. Basically there is no reservation station, and uOPs are directly issued to the underlying execution unit at a IssueWidth maximum rate.

117–118

Why are these two executing out of order?

Thanks for the review Andrea!

Your model assumes an unbounded queue of instructions (something like rudimental reservation station) where to store dispatched instructions.

If you mean InstQueue, then it is bounded by Bandwidth variable - the
maximum number of instructions that can be issued in the next cycle.

Correct me if I am wrong, but in-order processor don't use a reservation station.
In the absence of structural hazards, if data dependencies are met, then uOPs are directly issued to the underlying execution units.
So the dispatch event is not decoupled from the issue event.

The fact that your patch adds an unbounded queue sounds a bit strange to me. Not sure what @dmgreen
thinks about it. But this basically means that dispatch and issue are different events.

That is true. However, the problem here is that MCA timeline view counts stalls
as a number of cycles between dispatch and issue events. If dispatch and issue
always happen in the same cycle, stalls are not displayed:

[0,3]     .  DeeER  .    .    add	w13, w30, #1
[0,4]     .  DeeeER .    .    smulh	x30, x29, x28
[0,5]     .     DeeeER   .    smulh	x27, x30, x28
[0,6]     .        DeeeER.    smulh	xzr, x27, x26
[0,7]     .    .    DeeeER    umulh	x30, x29, x28

To avoid this, the implementation emits a dispatch event for instructions that
should be executed in the next cycle. If an instruction is unable to execute due
to a hazard, it is delayed and a stall is displayed starting from the dispatch
event:

[0,3]     . DeeeER  .    .    add	w13, w30, #4095, lsl #12
[0,4]     . DeeeeER .    .    smulh	x30, x29, x28
[0,5]     .  D==eeeeER   .    smulh	x27, x30, x28
[0,6]     .  D=====eeeeER.    smulh	xzr, x27, x26
[0,7]     .    .  D=eeeeER    umulh	x30, x29, x28

I remember that I did this intentionally, but now I'm not really convinced that
this difference is worth extra complexity. Let me know what you think about
this.

I also noticed how there are no checks on NumMicroOps. Is there a reason why you don't check for it?

Good point, I will fix that.

In one of the tests, the target is dual-issue. However, there are cycles where three opcodes are dispatched.
See for example the test where two loads are dispatched in a same cycle (with the first load decoded into two uOPs).

I think this should not happen. I will add a check for NumMicroOps.

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s
117–118

Madd and add are issued in the same cycle, subs is issued next.
However, they should not retire out-of-order. Some instructions can
retire out-of-order, but not these.

I have to look into this. Probably an RCU is actually needed for the
in-order pipeline.

Thanks for the review Andrea!

Your model assumes an unbounded queue of instructions (something like rudimental reservation station) where to store dispatched instructions.

If you mean InstQueue, then it is bounded by Bandwidth variable - the
maximum number of instructions that can be issued in the next cycle.

Correct me if I am wrong, but in-order processor don't use a reservation station.
In the absence of structural hazards, if data dependencies are met, then uOPs are directly issued to the underlying execution units.
So the dispatch event is not decoupled from the issue event.

The fact that your patch adds an unbounded queue sounds a bit strange to me. Not sure what @dmgreen
thinks about it. But this basically means that dispatch and issue are different events.

That is true. However, the problem here is that MCA timeline view counts stalls
as a number of cycles between dispatch and issue events. If dispatch and issue
always happen in the same cycle, stalls are not displayed:

[0,3]     .  DeeER  .    .    add	w13, w30, #1
[0,4]     .  DeeeER .    .    smulh	x30, x29, x28
[0,5]     .     DeeeER   .    smulh	x27, x30, x28
[0,6]     .        DeeeER.    smulh	xzr, x27, x26
[0,7]     .    .    DeeeER    umulh	x30, x29, x28

To avoid this, the implementation emits a dispatch event for instructions that
should be executed in the next cycle. If an instruction is unable to execute due
to a hazard, it is delayed and a stall is displayed starting from the dispatch
event:

[0,3]     . DeeeER  .    .    add	w13, w30, #4095, lsl #12
[0,4]     . DeeeeER .    .    smulh	x30, x29, x28
[0,5]     .  D==eeeeER   .    smulh	x27, x30, x28
[0,6]     .  D=====eeeeER.    smulh	xzr, x27, x26
[0,7]     .    .  D=eeeeER    umulh	x30, x29, x28

I remember that I did this intentionally, but now I'm not really convinced that
this difference is worth extra complexity. Let me know what you think about
this.

Ok, I understand now.
It was done intentionally to emphasize issue stalls.

Personally I prefer if the 'dispatch' event coincides with the 'issue' event.
From a theoretical point of view, those events should really be the same.

Conceptually, "dispatch" is only really useful when describing the lifetime of an instruction in an out-of-order processor.
For those processors it makes sense to distinguish between cycles where uOPs are picked from the decoder's queue, from cycles where uOPs are actually sent by the scheduler(s) to the underlying pipeline(s).

In future, we may introduce a new option for the timeline to toggle the generation of extra chars for dispatch stalls (possibly using a different character from = to avoid confusions).
That might be contributed by a separate patch in future. For now, if you don't mind I'd rather keep your patch simple.

I also noticed how there are no checks on NumMicroOps. Is there a reason why you don't check for it?

Good point, I will fix that.

In one of the tests, the target is dual-issue. However, there are cycles where three opcodes are dispatched.
See for example the test where two loads are dispatched in a same cycle (with the first load decoded into two uOPs).

I think this should not happen. I will add a check for NumMicroOps.

Thanks!

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s
117–118

In theory, younger instructions should not be allowed to reach the write-back stage before older instructions because that would lead to out-of-order execution.
In this case I was expecting a compulsory stall to artificially delay the issue of the add so that it can write-back in-order w.r.t. the madd.
What are those cases where it is allowed to write-back instructions out of order? Shouldn't architectural commits always happen in-order?

asavonic added inline comments.Jan 20 2021, 11:22 AM
llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s
117–118

In theory, younger instructions should not be allowed to reach the write-back stage before older instructions because that would lead to out-of-order execution.
In this case I was expecting a compulsory stall to artificially delay the issue of the add so that it can write-back in-order w.r.t. the madd.

I wonder how this works for instructions with early termination (sdiv, udiv).
@dmgreen, can you please comment on this?

What are those cases where it is allowed to write-back instructions out of order? Shouldn't architectural commits always happen in-order?

From Cortex-A55 optimization manual, s3.5.1 "Instructions with out-of-order completion":

While the Cortex-A55 core only issues instructions in-order, due to the number of cycles required to complete more complex floating-point and NEON instructions, out-of-order retire is allowed on the instructions described in this section. The nature of the Cortex-A55 microarchitecture is such that NEON and floating-point instructions of the same type have the same timing characteristics.

dmgreen added inline comments.Jan 20 2021, 11:29 AM
llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s
117–118

Yeah, I was about to quote the same thing!

The Cortex-R52, another in-order AArch32 cpu similar to the A53 mentions:
https://developer.arm.com/documentation/100026/0103/Cycle-Timings-and-Interlock-Behavior/Pipeline-behavior/Dual-issuing

If the Cortex-R52 processor determines that the pair must be dual-issued, it remains a pair until both instructions are retired.

So I believe it can depend on the CPU and possibly the instructions issued.

andreadb added inline comments.Jan 20 2021, 1:02 PM
llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s
117–118

I see.
That unfortunately complicates the whole implementation.
We do need a Retire Stage after all...

Is out-of-order execution limited in some ways in these processors?
For example: do these processors perform register renaming to break false dependencies?
If we don't need to worry about register renaming, then the implementation is simpler. I really hope that out-of-order execution is only really allowed in case of completely independent instructions , and just to avoid any bottlenecks caused by long latency instructions...

By the way, do we have a mechanism in place to identify instructions that can be executed out-of-order? I don't remember to have ever seen anything like that in the scheduling model.
If not, then we need a way to mark those instructions somehow. In case, we might be and be able to reuse the MCSchedPredicate framework for that (fingers crossed).

asavonic updated this revision to Diff 318858.Jan 24 2021, 12:40 PM
  • Added RCU support for the in-order pipeline. Some instructions should retire out-of-order, so a new flag was added to SchedClass.
  • Dispatch event is now emitted in the same cycle as an issue event.
  • Fixed code style issues.
asavonic added inline comments.Jan 24 2021, 1:10 PM
llvm/include/llvm/MCA/Stages/InOrderIssueStage.h
42–52

Done.

llvm/lib/MCA/CMakeLists.txt
21

Thanks, fixed.

llvm/lib/MCA/Context.cpp
31–33

Done.

31–40

Done.

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
35–38

Thank you. I added a check for both BeginGroup and EndGroup.

36

Added a check for NumMicroOps.

Regarding the ldr: the optimization manual for Cortex-A55 states that throughput for most load instructions is 1, so I guess it is correct.

50

Done.

108–115

Thank you for the explanation! Replaced the if statement with an assert.

143

The code is changed a bit to accommodate an RCU. Let me know if it still needs a comment.

149–151

Done.

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s
115–127

Fixed. Now a dispatch event and an issue event occur in the same cycle.

117–118

I changed the RetireStage, RCU and the scheduler model to support this.
RetireControlUnit is now used if it is defined in the scheduler model. I ran a couple of tests on hardware, and MCA results seem to match it pretty well.

Some instructions require out-of-order retire. I added RetireOOO flag to MCSchedClass to handle this feature. Instructions with this flag do not block other instructions and may retire as soon as they complete execution.

asavonic updated this revision to Diff 318943.Jan 25 2021, 3:44 AM

Fixed LIT InvalidMCSchedClassDesc.td.

Thanks for the updated patch.

I left a couple of minor comments below. Otherwise the design looks good to me.

@dmgreen @evgeny777, what is your opinion on this simulation pipeline? Do you think that this design might work well for other ARM in-order processors or there are other things that should be improved? Also, if you could try a few experiments with this patch, then that would be great.

Thanks,
-Andrea

llvm/include/llvm/Target/TargetSchedule.td
265–266

Note that instructions can have multiple writes (typically one per each register definition). So the comment should say that a 'write' is allowed to retire out-of-order.

Also, this field is only used by llvm-mca. The value of that field is ignored if the model doesn't describe an in-order processors.

llvm/utils/TableGen/SubtargetEmitter.cpp
1105

Basically, an instruction is allowed to retire out-of-order if RetireOOO is true for at least one of its writes.
This field is only meaningful for in-order subtargets, and is ignored for other targets.

I think that this should be better described in a comment (in TargetPredicates.td).

RKSimon added inline comments.Jan 27 2021, 10:35 AM
llvm/include/llvm/MC/MCSchedule.h
120

This is a separate problem, but windows builds currently takes 4 bytes for this (before and after this patch) as bool is treated as an int. Would it be possible to change this to:

uint16_t NumMicroOps : 13;
uint16_t BeginGroup : 1;
uint16_t EndGroup : 1;
uint16_t RetireOOO : 1;

Ideally replace the bools in an initial patch before this one?

asavonic updated this revision to Diff 321087.Feb 3 2021, 6:42 AM

Updated the comment for RetireOOO.

asavonic added inline comments.Feb 3 2021, 7:26 AM
llvm/include/llvm/MC/MCSchedule.h
120

Thanks. Uploaded a fix to https://reviews.llvm.org/D95954.

andreadb added a comment.EditedFeb 17 2021, 6:12 AM

Now that https://reviews.llvm.org/D95954 has been fixed, there are only a few things left to do:

  1. Disable the bottleneck analysis for in-order processors.
  2. Add a couple of lines to the release notes describing this new feature.
  3. Update the llvm-mca docs.

About the bottleneck analysis view:
Current implementation works under the assumption that the processor is out-of-order. The analysis internally observes events generated by the scheduler, as well as instruction state transitions. It correlates those events with scheduler queue availablity to track pressure increases due to data dependencies etc. For in-order processors, we need a completely different logic; the existing logic would not work (as you can tell from one of the tests that you have added, for which no bottlenecks were reported for data dependencies!).

We should raise a warning from the llvm-mca driver if the user explicitly requested the bottleneck analysis, and the processor is in-order. The warning should explain how the bottleneck analysis is currently unsupported for in-order processors. Note also that flag -all-views should not enable the bottleneck analysis if the simulated processor is in-order.

About the documentation: in a few paragraphs we mention "out-of-order" backends. In most cases, it is just a matter of removing the word "out-of-order", and the rest of the paragraph is still valid.
However, section "Instruction Flow" requires an extra paragraph for in-order processors. The instruction flow is much simpler for those processors, and the dispatch event is not decoupled from the issue event (it is all handled by your new "InOrderIssue" stage).
Please add a new section for your new "InOrderIssue" stage (ideally, it should go after the "Write-Back and Retire Stage" section). There you should briefly mention what happens to instructions during that stage. Please also add a small paragraph explaining how the current model allows for out-of-order execution of writes that are specially marked in the scheduling model.

I think this is all.
I plan to accept the patch once those three points are addressed.

Thanks,
-Andrea

Thanks for working on this, it looks like it will be very useful.

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
82

How come this asserts on ReadAdvance < 0? I though it was relatively common to have certain instructions requiring operands before the main stage the pipeline is based on.

andreadb added inline comments.Feb 19 2021, 3:10 AM
llvm/lib/MCA/Stages/InOrderIssueStage.cpp
82

I agree with Dave here. This logic must also work for the case where ReadAdvance is a negative values. There is no reason why we should restrict this logic to cases where ReadAdvance is >=0.

You also don't need the assert for UNKNOWN_CYCLES at line 82. If the number of "write cycles left" for a register is unknown, then you can safely ignore it, and leave StallCycles as is (i.e. simply continue to the next iteration of the loop).

If however the number of write cycles left is different than UNKNOWN_CYCLES, then you should always update StallCycles with the std::max between the actual StallCycles and CyclesLeft - ReadAdvance. A negative ReadAdvance would effectively "increase" the CyclesLeft (which is what we want to model here).

For this to work, CyclesLeft must be manipulated as a signed quantity (note that UNKNOWN_CYCLES is also a signed quantity; its value is -1). std::max requires for operands to be of the same type, so this may require an extra cast for StallCycles.
You should then get rid of the check at line 84, and always compute the std::max, to update StallCycles.

asavonic updated this revision to Diff 327409.Mar 2 2021, 4:37 AM

Supported UNKNOWN_CYCLES left and negative ReadAdvance.
Disabled bottleneck analysis for in-order CPUs.
Updated documentation.

Now that https://reviews.llvm.org/D95954 has been fixed, there are only a few things left to do:

  1. Disable the bottleneck analysis for in-order processors.
  2. Add a couple of lines to the release notes describing this new feature.
  3. Update the llvm-mca docs.

Thank you! Can you please check if wording in the docs is ok?

Regarding the bottleneck analysis: it seems to work for some cases
(A55-all-views.s), but not for others (A55-out-of-order-retire.s).
I will check what is wrong here. The feature is disabled for now as you
suggested.

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
82

How come this asserts on ReadAdvance < 0? I though it was relatively common to have certain instructions requiring operands before the main stage the pipeline is based on.

I wasn't sure that it is a valid case. Thank you for the explanation. I've
adjusted the code to handle negative ReadAdvance. There seems to be no such cases for A55, so I added a test for M7.

82

You also don't need the assert for UNKNOWN_CYCLES at line 82. If the number of "write cycles left" for a register is unknown, then you can safely ignore it, and leave StallCycles as is (i.e. simply continue to the next iteration of the loop).

I'm not sure that we can just ignore it. If it is unknown, then it is not zero,
so a register hazard should be detected. I've modified the code to return 1 cycle
stall, so we can check that the value is "known" in the next cycle.

In any case, this should never happen, because Instruction::execute is called
before a subsequent instruction issue, and it always sets CyclesLeft to a
known value.

If however the number of write cycles left is different than UNKNOWN_CYCLES, then you should always update StallCycles with the std::max between the actual StallCycles and CyclesLeft - ReadAdvance. A negative ReadAdvance would effectively "increase" the CyclesLeft (which is what we want to model here).

For this to work, CyclesLeft must be manipulated as a signed quantity (note that UNKNOWN_CYCLES is also a signed quantity; its value is -1). std::max requires for operands to be of the same type, so this may require an extra cast for StallCycles.

Right, this is fixed now.

You should then get rid of the check at line 84, and always compute the std::max, to update StallCycles.

The check is still needed to ensure that we don't cast a negative value to
unsigned.

Now that https://reviews.llvm.org/D95954 has been fixed, there are only a few things left to do:

  1. Disable the bottleneck analysis for in-order processors.
  2. Add a couple of lines to the release notes describing this new feature.
  3. Update the llvm-mca docs.

Thank you! Can you please check if wording in the docs is ok?

Regarding the bottleneck analysis: it seems to work for some cases
(A55-all-views.s), but not for others (A55-out-of-order-retire.s).
I will check what is wrong here. The feature is disabled for now as you
suggested.

Thanks Andrew for the updated patch!

The doc changes look good to me.

The code changes also look good, except for one thing (I left a comment below).

About the bottleneck analysis:
The current analysis assumes an out-of-order pipeline. So I wouldn't be surprised if there are cases where it doesn't work well.
As you might already know, the bottleneck analysis effectively observes pressure events which are dynamically generated by a Scheduler component. Based on the observed trace, and it the predicts if throughput was somehow limited by the lack of processor resources, and/or data dependencies.
The good thing about your patch is that your code already generates pressure events (from your new canExecute() method). A bottleneck analysis for in-order processors would be much simpler, since we don't need any complex analysis based on the knowledge of which instructions are in a PENDING/READY state in a scheduler buffer.

The best thing to do for now is to raise an llvm-mca bug about implementing a bottleneck analysis for in-order processors. You can raise it now, or wait until this patch is finally committed (either way is fine).

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
84

As I wrote before, you should simply ignore the case where the number of cycles left is equal to UNKNOWN_CYCLES.
You shouldn't early exit with an arbitrary number of cycles.

Basically what I am saying is that you should replace this return statement with a continue;

andreadb added inline comments.Mar 2 2021, 5:48 AM
llvm/lib/MCA/Stages/InOrderIssueStage.cpp
82

Sorry, I have only read your comment now.

It is true that, strictly speaking, UNKNOWN_CYCLES isn't equivalent to zero cycles.
However, in practice it shouldn't happen to have to deal with writes of UNKNOWN_CYCLES.
If I remember correctly, for the out-of-order simulator, we optimistically ignore it.

If you really want to enforce at least a 1-cycle delay for it, then fair enough. However you shouldn't return immediately. Instead, you should simply update the stall cycles quantity and then continue to the next iteration of the loop.

asavonic added inline comments.Mar 2 2021, 6:04 AM
llvm/lib/MCA/Stages/InOrderIssueStage.cpp
84

As I wrote before, you should simply ignore the case where the number of cycles left is equal to UNKNOWN_CYCLES.
You shouldn't early exit with an arbitrary number of cycles.

Basically what I am saying is that you should replace this return statement with a continue;

I mentioned this above, but I might be missing something.

What exactly UNKNOWN_CYCLES means in this case? I assume that for in-order
pipeline it means that an instruction is issued (since we issue in-order), but
we don't know when its write is going to be completed. If this is correct, then
we should not issue any instructions that depend on this write until we know that the write
is completed (and has CyclesLeft == 0).

return 1 here stalls the instruction for 1 cycle, and it will be checked again
in the next cycle. If we continue here instead, then the write is ignored.

asavonic updated this revision to Diff 327440.Mar 2 2021, 6:46 AM

Update StallCycles and continue instead of return,

andreadb accepted this revision.Mar 2 2021, 7:30 AM

LGTM.

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
84

OK, I did some archaeology (it has been a while since when I wrote that code.)

Long story short:
When simulating in-order processors, you should never end up in a situation where the number of cycles left is UNKNOWN_CYCLES. Your code was not wrong. Adding a continue is also OK (but not necessarily better). Ideally, you could just assert that the number of cycles is different than UNKNOWN_CYCLES. Up to you.

Long story:
Each register write is associated with a CyclesLeft quantity. The number of cycles left is unknown until the instruction is issued. Only at that point, we know exactly how many cycles are left before writes are committed to the register file.

On instruction issue, dependent instructions get notified, so that the CyclesLeft quantities associated with their register reads are also updated.

This is important when simulating an out-of-order backend, because the scheduler may contain a chain of dependent instructions, and not all the instructions of that chain are in a "ready" state.

The relevant code is in Instruction.cpp (see for example method WriteState::onInstructionIssued).

Back to our case:

For in-order processors, instructions are always issued in-order, and instructions are not buffered internally by a scheduler. So the "dispatch" and "issue" are basically the same event.

In the presence of data dependencies, instructions are artificially delayed until all inputs become available (i.e. inputs have a known number of cycles left). Only at that point, instructions are issued to the underlying execution units.

By construction, you cannot end up in a situation where the "current instruction" depends on an older instruction which hasn't started execution yet.

To put it in another way: By construction, it is always guaranteed that "older instruction" have already started execution. Therefore, the CyclesLeft value from each write is always known.

This revision is now accepted and ready to land.Mar 2 2021, 7:30 AM
asavonic added inline comments.Mar 2 2021, 9:19 AM
llvm/lib/MCA/Stages/InOrderIssueStage.cpp
84

Thanks for the confirmation. This matches my analysis as well.
Instruction::execute always updates CyclesLeft, so we never have
UNKNOWN_CYCLES here. Let's keep the handling in case this changes at some point.

andreadb added inline comments.Mar 2 2021, 11:34 AM
llvm/lib/MCA/Stages/InOrderIssueStage.cpp
84

Sounds good.

Thanks for working on this feature!

-Andrea

This revision was landed with ongoing or failed builds.Mar 4 2021, 3:12 AM
This revision was automatically updated to reflect the committed changes.
foad added a subscriber: foad.Mar 9 2021, 4:05 AM