This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/MCA/Stages/
-
llvm/
-
MCA/
-
Stages/
1
InOrderIssueStage.h
-
lib/
-
MCA/Stages/
-
Stages/
1/2
InOrderIssueStage.cpp
-
Target/AArch64/
-
AArch64/
-
AArch64SchedA55.td
-
test/tools/llvm-mca/AArch64/Cortex/
-
tools/
-
llvm-mca/
-
AArch64/
-
Cortex/
-
A55-all-stats.s
-
A55-all-views.s
1
A55-in-order-retire.s

Differential D98604

[MCA] Ensure that writes occur in-order
ClosedPublic

Authored by asavonic on Mar 14 2021, 10:15 AM.

Download Raw Diff

Details

Reviewers

andreadb
dmgreen
evgeny777

Commits

rGe6ce0db37847: [MCA] Ensure that writes occur in-order

Summary

Delay the issue of a new instruction if that leads to out-of-order
commits of writes.

This patch fixes the problem described in:
https://bugs.llvm.org/show_bug.cgi?id=41796#c3

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	30 ms	x64 debian > LLVM.tools/llvm-mca/AArch64/Cortex::A55-all-stats.s
	20 ms	x64 debian > LLVM.tools/llvm-mca/AArch64/Cortex::A55-all-views.s
	30 ms	x64 debian > LLVM.tools/llvm-mca/AArch64/Cortex::A55-in-order-retire.s
	20 ms	x64 debian > LLVM.tools/llvm-mca/AArch64/Cortex::A55-out-of-order-retire.s
	30 ms	x64 windows > LLVM.tools/llvm-mca/AArch64/Cortex::A55-all-stats.s
		View Full Test Results (8 Failed)

Event Timeline

asavonic created this revision.Mar 14 2021, 10:15 AM

Herald added subscribers: gbedwell, hiraditya. · View Herald TranscriptMar 14 2021, 10:15 AM

asavonic requested review of this revision.Mar 14 2021, 10:15 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 14 2021, 10:15 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B93714: Diff 330518.Mar 14 2021, 10:52 AM

foad added a subscriber: foad.Mar 14 2021, 12:16 PM

RKSimon added a subscriber: RKSimon.Mar 15 2021, 2:37 AM

andreadb added inline comments.Mar 15 2021, 4:33 AM

llvm/include/llvm/MCA/Stages/InOrderIssueStage.h
41	Why not just storing a number of cycles left? You can easily compute that latency once, and then decrease it on every cycle (saturating to zero). When you check for write-order hazards, you simply compare the longest write latency against that quantity. You won't need to recompute the `cycles left` from scratch starting from a `LastIssuedInst`. NOTE: it should not always reference the "last issued instruction". In case of OoO execution (see the comment below) the slowest write may not have been contributed by the last issued instruction. That quantity should always be the MAX write latency in general.
llvm/lib/MCA/Stages/InOrderIssueStage.cpp
152–158	There is still a bit of confusion around the meaning of flag `RetireOOO`, and what it actually means in practice for llvm-mca. Over the last few days, I did some searches on "out-of-order commit" and what it actually means in practice. It is not to be confused with "out-of-order execution". Our RCU knows how to commit "in-order", but it doesn't know how to simulate "out-of-order" commit. I don't have access to a cortex-a55, so I cannot run experiments on it. However, I think that for now it is safe to just assume that `RetireOOO` means: out-of-order execution, plus out-of-order commit. For out-of-order commit, we may not need to simulate an RCU. So I think your change on the RCU tokens is fine. The question however is: what happens if you alternate instructions with in-order/out-of-order execution? Example: `FDIV, ADD, FDIV, ADD, ...` Your `LastIssuedInst` currently points to the instruction that reaches write-back stage last. However, you still have to account for the FDIV write latency when checking if the following ADD `canIssue`. ADDs must still be issued in-order. So it needs to wait for the completion of FDIV. `RetireOOO` would become just a way to bypass the `checkWritesOrder` check. Could you also add a test for sequences of {FDIV,ADD} pairs?
llvm/test/tools/llvm-mca/AArch64/Cortex/A55-in-order-retire.s
106–108	tl;dr: not an inssue introduced by this patch. However, in future we may need to restrict the number of read/write ports available to prevent that too many instructions end up updating the PRF at the same time. All those instructions will reach write-back stage at the same time. I don't know how many write ports there are in the PRF of this processor. If it allows up to 2 writes per cycle, then the issue of the last instruction should be artificially delayed by one cycle. More in general, limits in the number of read/write ports in the register file are not currently modelled. It would be nice to add support for those in the future (assuming that this may become an important souce of inaccuracy especially when simulating in-order processors). The register file descriptor in the "extra processor info" struct is mainly a way to: limit the amount of phys registers for renaming enable/disable move elimination enable/disable the matching of zero-idioms. None of that is probably useful for in-order processors. We may instead want to add extra fields for the number of read/write ports. So that, when we check if an instruction can execute, we check the availability of read ports on issue, and we check the availability of write ports on write-back. Otherwise we stall the instruction. This would be a future development. How many write ports there are in the register file?

Disable RCU in a separate patch: D98628 [MCA] Disable RCU for InOrderIssueStage

Harbormaster completed remote builds in B93783: Diff 330611.Mar 15 2021, 5:29 AM

asavonic added inline comments.Mar 15 2021, 11:15 AM

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
152–158	There is still a bit of confusion around the meaning of flag `RetireOOO`, and what it actually means in practice for llvm-mca. Over the last few days, I did some searches on "out-of-order commit" and what it actually means in practice. It is not to be confused with "out-of-order execution". Our RCU knows how to commit "in-order", but it doesn't know how to simulate "out-of-order" commit. I think the intention is to allow some instructions to retire naturally, even if the order of retire does not match the program order. In that case, RCU can do nothing and allow an instruction to retire as soon as it finishes execution. I don't have access to a cortex-a55, so I cannot run experiments on it. However, I think that for now it is safe to just assume that `RetireOOO` means: out-of-order execution, plus out-of-order commit. For out-of-order commit, we may not need to simulate an RCU. So I think your change on the RCU tokens is fine. The question however is: what happens if you alternate instructions with in-order/out-of-order execution? Example: `FDIV, ADD, FDIV, ADD, ...` I don't think that Cortex-A55 can execute instructions out-of-order. Some FP instructions can commit/retire out-of-order, but they use a different set of registers and their timing characteristics are aligned (according to the documentation). The model we have in MCA now seems to match the hardware (at least by IPC): [0,0] DeeeeeeeeeeeER . . . fdiv s1, s2, s3 [0,1] DeeER. . . . . add w8, w8, #1 [0,2] .DeeER . . . . add w1, w2, w0 [0,3] .DeeER . . . . add w3, w4, #1 [0,4] . DeeER . . . . add w5, w6, w0 [0,5] . . DeeeeeeeeeeeER . fdiv s4, s5, s6 [0,6] . . DeeER . . . add w7, w9, w0 [0,7] . . DeeER . . . add w3, w1, w3 [0,8] . . DeeER . . . add w7, w2, w4 [0,9] . . DeeER. . . add w5, w5, w9 Note that I had to fix the FDIV latency/resource-cycles in LLVM scheduler model. Original values match the ARM documentation, but I get different timings just by running FDIV in a loop. The following sequence also matches the hardware: [0,0] DeeeeeeeeeeeER . . . . . fdiv s1, s2, s3 [0,1] DeeER. . . . . . . add w8, w8, #1 [0,2] . . . DeeeeeeeeeeeER . . fdiv s4, s2, s1 [0,3] . . . DeeER . . . . add w1, w8, #1 [1,0] . . . . DeeeeeeeeeeeER . fdiv s1, s2, s3 [1,1] . . . . DeeER. . . add w8, w8, #1 Your `LastIssuedInst` currently points to the instruction that reaches write-back stage last. However, you still have to account for the FDIV write latency when checking if the following ADD `canIssue`. ADDs must still be issued in-order. So it needs to wait for the completion of FDIV. `RetireOOO` would become just a way to bypass the `checkWritesOrder` check. Can you elaborate why ADD needs to wait for the completion of FDIV? If FDIV is allowed to retire out-of-order, then it can retire after ADD, so ADD doesn't have to wait. Could you also add a test for sequences of {FDIV,ADD} pairs? There is A55-out-of-order-retire.s test with FDIV and ADD sequence. I can add the second example above as a LIT test as well.

Hi Andrew,

Thanks to your last comment, I think I know have a better idea of how this all should work.
I still have some questions though (see below).

I don't think that Cortex-A55 can execute instructions out-of-order. Some FP instructions can commit/retire out-of-order,
but they use a different set of registers and their timing characteristics are aligned (according to the documentation).

OK. So out-of-order commit doesn't necessarily imply out-of-order execution. Makes sense.

Back to your check:

if (LastIssuedInst && !LastIssuedInst.getInstruction()->getDesc().RetireOOO) {
  // Delay the instruction to ensure that writes occur in program
  // order
  if (unsigned StallWritesOrder = checkWritesOrder(LastIssuedInst, IR)) {
    *StallCycles = StallWritesOrder;
  }
}

Unless I've read the code wrongly (which is a possibility...), then the check is only on the predecessor.
This works for example where an ADD follows an FDIV. However, it doesn't work if the FDIV follows the ADD. Is this expected?

According to that check, a RetiresOOO instruction following an ADD still needs to check for any potential structural hazards. It means that we still want it to reach the write-back stage in-order, even though - technically speaking - it is marked as RetiresOOO.

Is there a reason why you don't allow the bypassing of the write-back hazard check for all RetiresOOO instructions?

NOTE: normally this may not be a problem in practice, since RetiresOOO instructions (for what I have read) tend to have a high write latency. Still, it sounds a bit odd to me that we lift the check in one case, but not the other.

Can you elaborate why ADD needs to wait for the completion of FDIV?
If FDIV is allowed to retire out-of-order, then it can retire after ADD, so ADD doesn't have to wait.

So, the current semantic allows an ADD (or any other instruction following a RetireOOO instruction) to ignore the structural hazard check on the write-back.

Example:

[0,0]     DeeeeeeeeeeeER .    .    .   fdiv	s1, s2, s3
[0,1]     DeeER.    .    .    .    .   add	w8, w8, #1

However - at least in theory - while the FDIV can commit out-of-order, the ADD is not "special" in any way. In normal circumstances, I would have expected the ADD to be subject to the "usual" structural hazard checks for the write-back.

Thankfully, your comment clarified this to me. You mentioned the following very important experiment:

The model we have in MCA now seems to match the hardware (at least by IPC):

[0,0]     DeeeeeeeeeeeER .    .    .   fdiv	s1, s2, s3
[0,1]     DeeER.    .    .    .    .   add	w8, w8, #1
[0,2]     .DeeER    .    .    .    .   add	w1, w2, w0
[0,3]     .DeeER    .    .    .    .   add	w3, w4, #1
[0,4]     . DeeER   .    .    .    .   add	w5, w6, w0
[0,5]     .    .  DeeeeeeeeeeeER   .   fdiv	s4, s5, s6
[0,6]     .    .  DeeER  .    .    .   add	w7, w9, w0
[0,7]     .    .   DeeER .    .    .   add	w3, w1, w3
[0,8]     .    .   DeeER .    .    .   add	w7, w2, w4
[0,9]     .    .    DeeER.    .    .   add	w5, w5, w9

The above example is very important because it clearly contradicts my original assumption on the write-back checks.
If the write-back check was still performed, then the ADD would be in the critical path, and the block latency would be proportional to the length of the ADDs tail.
However, your comment suggests that benchmarks report a smaller latency, and therefore the ADD sequence latency is hidden by the FDIV latency. Meaning, that those ADDs are not in the critical path. It should become even more obvious if you make that tail of ADDs even longer (but still small enough to be fully hidden by the FDIV latency).

Back to the code:
I still think that you only need a latency value instead of storing LastIssuedInst. Also, shouldn't LastIssuedInst always point to the last non-RetireOOO instruction?

Note that I had to fix the FDIV latency/resource-cycles in LLVM scheduler model.
Original values match the ARM documentation, but I get different timings just by running FDIV in a loop.

Unrelated to this patch. However, can it be that the FDIV latency is somehow dependent on the values passed in input? It may be worthy to run the same micro-benchmark with a variety of different values in input, to make sure that the execution units are not taking shortcuts. Just an idea.

In D98604#2628693, @andreadb wrote:

Is there a reason why you don't allow the bypassing of the write-back hazard check for all RetiresOOO instructions?

NOTE: normally this may not be a problem in practice, since RetiresOOO instructions (for what I have read) tend to have a high write latency. Still, it sounds a bit odd to me that we lift the check in one case, but not the other.

You're right, I missed that. The issue does not show up in the tests because of high latency of FDIV. This is now fixed in the new revision of the patch.

The above example is very important because it clearly contradicts my original assumption on the write-back checks.
If the write-back check was still performed, then the ADD would be in the critical path, and the block latency would be proportional to the length of the ADDs tail.
However, your comment suggests that benchmarks report a smaller latency, and therefore the ADD sequence latency is hidden by the FDIV latency. Meaning, that those ADDs are not in the critical path. It should become even more obvious if you make that tail of ADDs even longer (but still small enough to be fully hidden by the FDIV latency).

Right, this was the intention of the experiment.
I also found that the same feature is described differently in the Cortex-R52 manual:

Floating-point divisions and square roots (VDIV and VSQRT instructions)
operate out-of-order. These instructions always complete in a single cycle
but there is a delay in the division or square root result being written to
the register file. Subsequent instructions are allowed to issue, execute,
and retire, provided they do not depend on the result of the VDIV, or VSQRT,
and they are not VDIV or VSQRT themselves. If a dependency is detected, the
pipeline interlocks and waits for the availability of the result before
continuing.

I think this confirms our findings.

Although this part is a bit problematic: "Subsequent instructions are allowed to
issue, execute, and retire, provided they do not depend on the result of the
VDIV, or VSQRT, and they are not VDIV or VSQRT themselves". MCA does not handle
this, but this should never happen (at least for A55), because the FDIV/VSQRT
instructions go through the same FP-pipeline, and take ResourceCycles
according to their latency.

Back to the code:
I still think that you only need a latency value instead of storing LastIssuedInst. Also, shouldn't LastIssuedInst always point to the last non-RetireOOO instruction?

Right, both points should be addressed in the new revision of the patch.

Note that I had to fix the FDIV latency/resource-cycles in LLVM scheduler model.
Original values match the ARM documentation, but I get different timings just by running FDIV in a loop.
Unrelated to this patch. However, can it be that the FDIV latency is somehow dependent on the values passed in input? It may be worthy to run the same micro-benchmark with a variety of different values in input, to make sure that the execution units are not taking shortcuts. Just an idea.

Documentation for Cortex-A55 does not mention this, and the manual for
Cortex-R52 states that latencies for FDIV are data-independent. But that is a
good point. I should set all registers to specific values, and then change them
to see the difference.

Store LastWriteBackCycle instead of recomputing it.

Harbormaster completed remote builds in B94119: Diff 331088.Mar 16 2021, 2:21 PM

LGTM.

Thanks a lot Andrew.

P.s.: once this is committed, we can resume the review on the other patch.

-Andrea

Although this part is a bit problematic: "Subsequent instructions are allowed to
issue, execute, and retire, provided they do not depend on the result of the
VDIV, or VSQRT, and they are not VDIV or VSQRT themselves". MCA does not handle
this, but this should never happen (at least for A55), because the FDIV/VSQRT
instructions go through the same FP-pipeline, and take ResourceCycles
according to their latency.

Yeah. This problem has to do with the presence of non-pipelined execution units. On Intel and AMD too, the divider is non-pipelined. So, when a division is issued, the divider gets busy (and therefore unavailable) for a given number of cycles, until the division terminates. As you wrote, the way how people normally address "issues" related to non-pipelined resources is by tweaking resource cycle consumptions in in the scheduling model.

This revision is now accepted and ready to land.Mar 17 2021, 3:57 AM

This revision was landed with ongoing or failed builds.Mar 18 2021, 7:11 AM

Closed by commit rGe6ce0db37847: [MCA] Ensure that writes occur in-order (authored by asavonic). · Explain Why

This revision was automatically updated to reflect the committed changes.

asavonic added a commit: rGe6ce0db37847: [MCA] Ensure that writes occur in-order.

asavonic mentioned this in rG292da93d59a3: [MCA] Disable RCU for InOrderIssueStage.Mar 24 2021, 3:56 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

MCA/

Stages/

InOrderIssueStage.h

1 line

lib/

MCA/

Stages/

InOrderIssueStage.cpp

42 lines

Target/

AArch64/

AArch64SchedA55.td

1 line

test/

tools/

llvm-mca/

AArch64/

Cortex/

A55-all-stats.s

2 lines

A55-all-views.s

10 lines

A55-in-order-retire.s

68 lines

Diff 330518

llvm/include/llvm/MCA/Stages/InOrderIssueStage.h

Show All 32 Lines	class InOrderIssueStage final : public Stage {
const MCSchedModel &SM;		const MCSchedModel &SM;
const MCSubtargetInfo &STI;		const MCSubtargetInfo &STI;
RetireControlUnit &RCU;		RetireControlUnit &RCU;
RegisterFile &PRF;		RegisterFile &PRF;
std::unique_ptr<ResourceManager> RM;		std::unique_ptr<ResourceManager> RM;

/// Instructions that were issued, but not executed yet.		/// Instructions that were issued, but not executed yet.
SmallVector<InstRef, 4> IssuedInst;		SmallVector<InstRef, 4> IssuedInst;
		InstRef LastIssuedInst;
		andreadbUnsubmitted Not Done Reply Inline Actions Why not just storing a number of cycles left? You can easily compute that latency once, and then decrease it on every cycle (saturating to zero). When you check for write-order hazards, you simply compare the longest write latency against that quantity. You won't need to recompute the `cycles left` from scratch starting from a `LastIssuedInst`. NOTE: it should not always reference the "last issued instruction". In case of OoO execution (see the comment below) the slowest write may not have been contributed by the last issued instruction. That quantity should always be the MAX write latency in general. andreadb: Why not just storing a number of cycles left? You can easily compute that latency once, and…

/// Number of instructions issued in the current cycle.		/// Number of instructions issued in the current cycle.
unsigned NumIssued;		unsigned NumIssued;

/// If an instruction cannot execute due to an unmet register or resource		/// If an instruction cannot execute due to an unmet register or resource
/// dependency, the it is stalled for StallCyclesLeft.		/// dependency, the it is stalled for StallCyclesLeft.
InstRef StalledInst;		InstRef StalledInst;
unsigned StallCyclesLeft;		unsigned StallCyclesLeft;
Show All 36 Lines

llvm/lib/MCA/Stages/InOrderIssueStage.cpp

Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	static bool hasResourceHazard(const ResourceManager &RM, const InstRef &IR) {
if (RM.checkAvailability(IR.getInstruction()->getDesc())) {		if (RM.checkAvailability(IR.getInstruction()->getDesc())) {
LLVM_DEBUG(dbgs() << "[E] Stall #" << IR << '\n');		LLVM_DEBUG(dbgs() << "[E] Stall #" << IR << '\n');
return true;		return true;
}		}

return false;		return false;
}		}

		// Check that all writes of the First instruction occur before writes
		// of the Second instruction. Otherwise return the number of cycles
		// between the first write of the Second instruction and the last
		// write of the First instruction.
		static unsigned checkWritesOrder(const InstRef &First, const InstRef &Second) {
		unsigned FirstWriteEnd = 0;
		for (const WriteState &FirstWS : First.getInstruction()->getDefs()) {
		int CyclesLeft = FirstWS.getCyclesLeft();
		if (CyclesLeft == UNKNOWN_CYCLES)
		CyclesLeft = FirstWS.getLatency();
		if (CyclesLeft < 0)
		CyclesLeft = 0;
		FirstWriteEnd = std::max(FirstWriteEnd, (unsigned)CyclesLeft);
		}

		unsigned SecondWriteStart = ~0U;
		for (const WriteState &SecondWS : Second.getInstruction()->getDefs()) {
		int CyclesLeft = SecondWS.getCyclesLeft();
		if (CyclesLeft == UNKNOWN_CYCLES)
		CyclesLeft = SecondWS.getLatency();
		if (CyclesLeft < 0)
		CyclesLeft = 0;
		SecondWriteStart = std::min(SecondWriteStart, (unsigned)CyclesLeft);
		}

		if (SecondWriteStart >= FirstWriteEnd)
		return 0;

		return FirstWriteEnd - SecondWriteStart;
		}

/// Return a number of cycles left until register requirements of the		/// Return a number of cycles left until register requirements of the
/// instructions are met.		/// instructions are met.
static unsigned checkRegisterHazard(const RegisterFile &PRF,		static unsigned checkRegisterHazard(const RegisterFile &PRF,
const MCSchedModel &SM,		const MCSchedModel &SM,
const MCSubtargetInfo &STI,		const MCSubtargetInfo &STI,
const InstRef &IR) {		const InstRef &IR) {
unsigned StallCycles = 0;		unsigned StallCycles = 0;
SmallVector<WriteRef, 4> Writes;		SmallVector<WriteRef, 4> Writes;
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	bool InOrderIssueStage::canExecute(const InstRef &IR,
} else if (hasResourceHazard(*RM, IR)) {		} else if (hasResourceHazard(*RM, IR)) {
*StallCycles = 1;		*StallCycles = 1;
notifyEvent<HWStallEvent>(		notifyEvent<HWStallEvent>(
HWStallEvent(HWStallEvent::DispatchGroupStall, IR));		HWStallEvent(HWStallEvent::DispatchGroupStall, IR));
notifyEvent<HWPressureEvent>(		notifyEvent<HWPressureEvent>(
HWPressureEvent(HWPressureEvent::RESOURCES, IR));		HWPressureEvent(HWPressureEvent::RESOURCES, IR));
}		}

		if (LastIssuedInst && !LastIssuedInst.getInstruction()->getDesc().RetireOOO) {
		// Delay the instruction to ensure that writes occur in program
		// order
		if (unsigned StallWritesOrder = checkWritesOrder(LastIssuedInst, IR)) {
		*StallCycles = StallWritesOrder;
		}
		}
		andreadbUnsubmitted Not Done Reply Inline Actions There is still a bit of confusion around the meaning of flag `RetireOOO`, and what it actually means in practice for llvm-mca. Over the last few days, I did some searches on "out-of-order commit" and what it actually means in practice. It is not to be confused with "out-of-order execution". Our RCU knows how to commit "in-order", but it doesn't know how to simulate "out-of-order" commit. I don't have access to a cortex-a55, so I cannot run experiments on it. However, I think that for now it is safe to just assume that `RetireOOO` means: out-of-order execution, plus out-of-order commit. For out-of-order commit, we may not need to simulate an RCU. So I think your change on the RCU tokens is fine. The question however is: what happens if you alternate instructions with in-order/out-of-order execution? Example: `FDIV, ADD, FDIV, ADD, ...` Your `LastIssuedInst` currently points to the instruction that reaches write-back stage last. However, you still have to account for the FDIV write latency when checking if the following ADD `canIssue`. ADDs must still be issued in-order. So it needs to wait for the completion of FDIV. `RetireOOO` would become just a way to bypass the `checkWritesOrder` check. Could you also add a test for sequences of {FDIV,ADD} pairs? andreadb: There is still a bit of confusion around the meaning of flag `RetireOOO`, and what it actually…
		asavonicAuthorUnsubmitted Done Reply Inline Actions There is still a bit of confusion around the meaning of flag `RetireOOO`, and what it actually means in practice for llvm-mca. Over the last few days, I did some searches on "out-of-order commit" and what it actually means in practice. It is not to be confused with "out-of-order execution". Our RCU knows how to commit "in-order", but it doesn't know how to simulate "out-of-order" commit. I think the intention is to allow some instructions to retire naturally, even if the order of retire does not match the program order. In that case, RCU can do nothing and allow an instruction to retire as soon as it finishes execution. I don't have access to a cortex-a55, so I cannot run experiments on it. However, I think that for now it is safe to just assume that `RetireOOO` means: out-of-order execution, plus out-of-order commit. For out-of-order commit, we may not need to simulate an RCU. So I think your change on the RCU tokens is fine. The question however is: what happens if you alternate instructions with in-order/out-of-order execution? Example: `FDIV, ADD, FDIV, ADD, ...` I don't think that Cortex-A55 can execute instructions out-of-order. Some FP instructions can commit/retire out-of-order, but they use a different set of registers and their timing characteristics are aligned (according to the documentation). The model we have in MCA now seems to match the hardware (at least by IPC): [0,0] DeeeeeeeeeeeER . . . fdiv s1, s2, s3 [0,1] DeeER. . . . . add w8, w8, #1 [0,2] .DeeER . . . . add w1, w2, w0 [0,3] .DeeER . . . . add w3, w4, #1 [0,4] . DeeER . . . . add w5, w6, w0 [0,5] . . DeeeeeeeeeeeER . fdiv s4, s5, s6 [0,6] . . DeeER . . . add w7, w9, w0 [0,7] . . DeeER . . . add w3, w1, w3 [0,8] . . DeeER . . . add w7, w2, w4 [0,9] . . DeeER. . . add w5, w5, w9 Note that I had to fix the FDIV latency/resource-cycles in LLVM scheduler model. Original values match the ARM documentation, but I get different timings just by running FDIV in a loop. The following sequence also matches the hardware: [0,0] DeeeeeeeeeeeER . . . . . fdiv s1, s2, s3 [0,1] DeeER. . . . . . . add w8, w8, #1 [0,2] . . . DeeeeeeeeeeeER . . fdiv s4, s2, s1 [0,3] . . . DeeER . . . . add w1, w8, #1 [1,0] . . . . DeeeeeeeeeeeER . fdiv s1, s2, s3 [1,1] . . . . DeeER. . . add w8, w8, #1 Your `LastIssuedInst` currently points to the instruction that reaches write-back stage last. However, you still have to account for the FDIV write latency when checking if the following ADD `canIssue`. ADDs must still be issued in-order. So it needs to wait for the completion of FDIV. `RetireOOO` would become just a way to bypass the `checkWritesOrder` check. Can you elaborate why ADD needs to wait for the completion of FDIV? If FDIV is allowed to retire out-of-order, then it can retire after ADD, so ADD doesn't have to wait. Could you also add a test for sequences of {FDIV,ADD} pairs? There is A55-out-of-order-retire.s test with FDIV and ADD sequence. I can add the second example above as a LIT test as well. asavonic: > There is still a bit of confusion around the meaning of flag `RetireOOO`, and what it…

return *StallCycles == 0;		return *StallCycles == 0;
}		}

static void addRegisterReadWrite(RegisterFile &PRF, Instruction &IS,		static void addRegisterReadWrite(RegisterFile &PRF, Instruction &IS,
unsigned SourceIndex,		unsigned SourceIndex,
const MCSubtargetInfo &STI,		const MCSubtargetInfo &STI,
SmallVectorImpl<unsigned> &UsedRegs) {		SmallVectorImpl<unsigned> &UsedRegs) {
assert(!IS.isEliminated());		assert(!IS.isEliminated());
Show All 27 Lines	static void notifyInstructionDispatch(const InstRef &IR, unsigned Ops,
LLVM_DEBUG(dbgs() << "[E] Dispatched #" << IR << "\n");		LLVM_DEBUG(dbgs() << "[E] Dispatched #" << IR << "\n");
}		}

llvm::Error InOrderIssueStage::execute(InstRef &IR) {		llvm::Error InOrderIssueStage::execute(InstRef &IR) {
Instruction &IS = *IR.getInstruction();		Instruction &IS = *IR.getInstruction();
const InstrDesc &Desc = IS.getDesc();		const InstrDesc &Desc = IS.getDesc();

unsigned RCUTokenID = RetireControlUnit::UnhandledTokenID;		unsigned RCUTokenID = RetireControlUnit::UnhandledTokenID;
if (!Desc.RetireOOO)
RCUTokenID = RCU.dispatch(IR);
IS.dispatch(RCUTokenID);		IS.dispatch(RCUTokenID);

if (Desc.EndGroup) {		if (Desc.EndGroup) {
Bandwidth = 0;		Bandwidth = 0;
} else {		} else {
unsigned NumMicroOps = IR.getInstruction()->getNumMicroOps();		unsigned NumMicroOps = IR.getInstruction()->getNumMicroOps();
assert(Bandwidth >= NumMicroOps);		assert(Bandwidth >= NumMicroOps);
Bandwidth -= NumMicroOps;		Bandwidth -= NumMicroOps;
Show All 32 Lines	llvm::Error InOrderIssueStage::tryIssue(InstRef &IR, unsigned *StallCycles) {
// Replace resource masks with valid resource processor IDs.		// Replace resource masks with valid resource processor IDs.
for (std::pair<ResourceRef, ResourceCycles> &Use : UsedResources) {		for (std::pair<ResourceRef, ResourceCycles> &Use : UsedResources) {
uint64_t Mask = Use.first.first;		uint64_t Mask = Use.first.first;
Use.first.first = RM->resolveResourceMask(Mask);		Use.first.first = RM->resolveResourceMask(Mask);
}		}
notifyInstructionExecute(IR, UsedResources, *this);		notifyInstructionExecute(IR, UsedResources, *this);

IssuedInst.push_back(IR);		IssuedInst.push_back(IR);
		LastIssuedInst = IR;
++NumIssued;		++NumIssued;

return llvm::ErrorSuccess();		return llvm::ErrorSuccess();
}		}

llvm::Error InOrderIssueStage::updateIssuedInst() {		llvm::Error InOrderIssueStage::updateIssuedInst() {
// Update other instructions. Executed instructions will be retired during the		// Update other instructions. Executed instructions will be retired during the
// next cycle.		// next cycle.
▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64SchedA55.td

	Show First 20 Lines • Show All 333 Lines • ▼ Show 20 Lines
	def : InstRW<[CortexA55WriteFDivDP], (instrs FDIVDrr)>;			def : InstRW<[CortexA55WriteFDivDP], (instrs FDIVDrr)>;
	def : InstRW<[CortexA55WriteFDivHP], (instregex "^FDIVv.*16$")>;			def : InstRW<[CortexA55WriteFDivHP], (instregex "^FDIVv.*16$")>;
	def : InstRW<[CortexA55WriteFDivSP], (instregex "^FDIVv.*32$")>;			def : InstRW<[CortexA55WriteFDivSP], (instregex "^FDIVv.*32$")>;
	def : InstRW<[CortexA55WriteFDivDP], (instregex "^FDIVv.*64$")>;			def : InstRW<[CortexA55WriteFDivDP], (instregex "^FDIVv.*64$")>;
	def : InstRW<[CortexA55WriteFSqrtHP], (instregex "^.SQRT.16$")>;			def : InstRW<[CortexA55WriteFSqrtHP], (instregex "^.SQRT.16$")>;
	def : InstRW<[CortexA55WriteFSqrtSP], (instregex "^.SQRT.32$")>;			def : InstRW<[CortexA55WriteFSqrtSP], (instregex "^.SQRT.32$")>;
	def : InstRW<[CortexA55WriteFSqrtDP], (instregex "^.SQRT.64$")>;			def : InstRW<[CortexA55WriteFSqrtDP], (instregex "^.SQRT.64$")>;

	def A55RCU : RetireControlUnit<64, 0>;
	}			}

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-stats.s

	Show All 29 Lines
	# CHECK-NEXT: 2 3 1.00 * ldr w4, [x2], #4			# CHECK-NEXT: 2 3 1.00 * ldr w4, [x2], #4
	# CHECK-NEXT: 1 3 1.00 * ldr w5, [x3]			# CHECK-NEXT: 1 3 1.00 * ldr w5, [x3]
	# CHECK-NEXT: 1 4 1.00 madd w0, w5, w4, w0			# CHECK-NEXT: 1 4 1.00 madd w0, w5, w4, w0
	# CHECK-NEXT: 1 3 0.50 add x3, x3, x13			# CHECK-NEXT: 1 3 0.50 add x3, x3, x13
	# CHECK-NEXT: 1 3 0.50 subs x1, x1, #1			# CHECK-NEXT: 1 3 0.50 subs x1, x1, #1
	# CHECK-NEXT: 1 4 1.00 * str w0, [x21, x18, lsl #2]			# CHECK-NEXT: 1 4 1.00 * str w0, [x21, x18, lsl #2]

	# CHECK: Dynamic Dispatch Stall Cycles:			# CHECK: Dynamic Dispatch Stall Cycles:
	# CHECK-NEXT: RAT - Register unavailable: 10 (47.6%)			# CHECK-NEXT: RAT - Register unavailable: 8 (38.1%)
	# CHECK-NEXT: RCU - Retire tokens unavailable: 0			# CHECK-NEXT: RCU - Retire tokens unavailable: 0
	# CHECK-NEXT: SCHEDQ - Scheduler full: 0			# CHECK-NEXT: SCHEDQ - Scheduler full: 0
	# CHECK-NEXT: LQ - Load queue full: 0			# CHECK-NEXT: LQ - Load queue full: 0
	# CHECK-NEXT: SQ - Store queue full: 0			# CHECK-NEXT: SQ - Store queue full: 0
	# CHECK-NEXT: GROUP - Static restrictions on the dispatch group: 0			# CHECK-NEXT: GROUP - Static restrictions on the dispatch group: 0

	# CHECK: Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:			# CHECK: Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
	# CHECK-NEXT: [# dispatched], [# cycles]			# CHECK-NEXT: [# dispatched], [# cycles]
	▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s

	Show All 29 Lines
	# CHECK-NEXT: 2 3 1.00 * ldr w4, [x2], #4			# CHECK-NEXT: 2 3 1.00 * ldr w4, [x2], #4
	# CHECK-NEXT: 1 3 1.00 * ldr w5, [x3]			# CHECK-NEXT: 1 3 1.00 * ldr w5, [x3]
	# CHECK-NEXT: 1 4 1.00 madd w0, w5, w4, w0			# CHECK-NEXT: 1 4 1.00 madd w0, w5, w4, w0
	# CHECK-NEXT: 1 3 0.50 add x3, x3, x13			# CHECK-NEXT: 1 3 0.50 add x3, x3, x13
	# CHECK-NEXT: 1 3 0.50 subs x1, x1, #1			# CHECK-NEXT: 1 3 0.50 subs x1, x1, #1
	# CHECK-NEXT: 1 4 1.00 * str w0, [x21, x18, lsl #2]			# CHECK-NEXT: 1 4 1.00 * str w0, [x21, x18, lsl #2]

	# CHECK: Dynamic Dispatch Stall Cycles:			# CHECK: Dynamic Dispatch Stall Cycles:
	# CHECK-NEXT: RAT - Register unavailable: 10 (47.6%)			# CHECK-NEXT: RAT - Register unavailable: 8 (38.1%)
	# CHECK-NEXT: RCU - Retire tokens unavailable: 0			# CHECK-NEXT: RCU - Retire tokens unavailable: 0
	# CHECK-NEXT: SCHEDQ - Scheduler full: 0			# CHECK-NEXT: SCHEDQ - Scheduler full: 0
	# CHECK-NEXT: LQ - Load queue full: 0			# CHECK-NEXT: LQ - Load queue full: 0
	# CHECK-NEXT: SQ - Store queue full: 0			# CHECK-NEXT: SQ - Store queue full: 0
	# CHECK-NEXT: GROUP - Static restrictions on the dispatch group: 0			# CHECK-NEXT: GROUP - Static restrictions on the dispatch group: 0

	# CHECK: Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:			# CHECK: Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
	# CHECK-NEXT: [# dispatched], [# cycles]			# CHECK-NEXT: [# dispatched], [# cycles]
	▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines

	# CHECK: Timeline view:			# CHECK: Timeline view:
	# CHECK-NEXT: 0123456789			# CHECK-NEXT: 0123456789
	# CHECK-NEXT: Index 0123456789 0			# CHECK-NEXT: Index 0123456789 0

	# CHECK: [0,0] DeeER. . . . ldr w4, [x2], #4			# CHECK: [0,0] DeeER. . . . ldr w4, [x2], #4
	# CHECK-NEXT: [0,1] .DeeER . . . ldr w5, [x3]			# CHECK-NEXT: [0,1] .DeeER . . . ldr w5, [x3]
	# CHECK-NEXT: [0,2] . DeeeER. . . madd w0, w5, w4, w0			# CHECK-NEXT: [0,2] . DeeeER. . . madd w0, w5, w4, w0
	# CHECK-NEXT: [0,3] . DeeE-R. . . add x3, x3, x13			# CHECK-NEXT: [0,3] . DeeER. . . add x3, x3, x13
	# CHECK-NEXT: [0,4] . DeeER. . . subs x1, x1, #1			# CHECK-NEXT: [0,4] . DeeER. . . subs x1, x1, #1
	# CHECK-NEXT: [0,5] . . DeeeER . . str w0, [x21, x18, lsl #2]			# CHECK-NEXT: [0,5] . . DeeeER . . str w0, [x21, x18, lsl #2]
	# CHECK-NEXT: [1,0] . . DeeER . . ldr w4, [x2], #4			# CHECK-NEXT: [1,0] . . DeeER . . ldr w4, [x2], #4
	# CHECK-NEXT: [1,1] . . DeeER . . ldr w5, [x3]			# CHECK-NEXT: [1,1] . . DeeER . . ldr w5, [x3]
	# CHECK-NEXT: [1,2] . . . DeeeER . madd w0, w5, w4, w0			# CHECK-NEXT: [1,2] . . . DeeeER . madd w0, w5, w4, w0
	# CHECK-NEXT: [1,3] . . . DeeE-R . add x3, x3, x13			# CHECK-NEXT: [1,3] . . . DeeER . add x3, x3, x13
	# CHECK-NEXT: [1,4] . . . DeeER . subs x1, x1, #1			# CHECK-NEXT: [1,4] . . . DeeER . subs x1, x1, #1
	# CHECK-NEXT: [1,5] . . . DeeeER str w0, [x21, x18, lsl #2]			# CHECK-NEXT: [1,5] . . . DeeeER str w0, [x21, x18, lsl #2]

	# CHECK: Average Wait times (based on the timeline view):			# CHECK: Average Wait times (based on the timeline view):
	# CHECK-NEXT: [0]: Executions			# CHECK-NEXT: [0]: Executions
	# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue			# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
	# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready			# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
	# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage			# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage

	# CHECK: [0] [1] [2] [3]			# CHECK: [0] [1] [2] [3]
	# CHECK-NEXT: 0. 2 0.0 0.0 0.0 ldr w4, [x2], #4			# CHECK-NEXT: 0. 2 0.0 0.0 0.0 ldr w4, [x2], #4
	# CHECK-NEXT: 1. 2 0.0 0.0 0.0 ldr w5, [x3]			# CHECK-NEXT: 1. 2 0.0 0.0 0.0 ldr w5, [x3]
	# CHECK-NEXT: 2. 2 0.0 0.0 0.0 madd w0, w5, w4, w0			# CHECK-NEXT: 2. 2 0.0 0.0 0.0 madd w0, w5, w4, w0
	# CHECK-NEXT: 3. 2 0.0 0.0 1.0 add x3, x3, x13			# CHECK-NEXT: 3. 2 0.0 0.0 0.0 add x3, x3, x13
	# CHECK-NEXT: 4. 2 0.0 0.0 0.0 subs x1, x1, #1			# CHECK-NEXT: 4. 2 0.0 0.0 0.0 subs x1, x1, #1
	# CHECK-NEXT: 5. 2 0.0 0.0 0.0 str w0, [x21, x18, lsl #2]			# CHECK-NEXT: 5. 2 0.0 0.0 0.0 str w0, [x21, x18, lsl #2]
	# CHECK-NEXT: 2 0.0 0.0 0.2 <total>			# CHECK-NEXT: 2 0.0 0.0 0.0 <total>

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-in-order-retire.s

	# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py			# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
	# RUN: llvm-mca -mtriple=aarch64 -mcpu=cortex-a55 --all-stats --all-views --iterations=2 < %s \| FileCheck %s			# RUN: llvm-mca -mtriple=aarch64 -mcpu=cortex-a55 --all-stats --all-views --iterations=2 < %s \| FileCheck %s

	sdiv w12, w21, w0			sdiv w12, w21, w0
	add w8, w8, #1			add w8, w8, #1
	add w1, w2, w0			add w1, w2, w0
	add w3, w4, #1			add w3, w4, #1
	add w5, w6, w0			add w5, w6, w0
	add w7, w9, w0			add w7, w9, w0

	# CHECK: Iterations: 2			# CHECK: Iterations: 2
	# CHECK-NEXT: Instructions: 12			# CHECK-NEXT: Instructions: 12
	# CHECK-NEXT: Total Cycles: 18			# CHECK-NEXT: Total Cycles: 20
	# CHECK-NEXT: Total uOps: 12			# CHECK-NEXT: Total uOps: 12

	# CHECK: Dispatch Width: 2			# CHECK: Dispatch Width: 2
	# CHECK-NEXT: uOps Per Cycle: 0.67			# CHECK-NEXT: uOps Per Cycle: 0.60
	# CHECK-NEXT: IPC: 0.67			# CHECK-NEXT: IPC: 0.60
	# CHECK-NEXT: Block RThroughput: 8.0			# CHECK-NEXT: Block RThroughput: 8.0

	# CHECK: Instruction Info:			# CHECK: Instruction Info:
	# CHECK-NEXT: [1]: #uOps			# CHECK-NEXT: [1]: #uOps
	# CHECK-NEXT: [2]: Latency			# CHECK-NEXT: [2]: Latency
	# CHECK-NEXT: [3]: RThroughput			# CHECK-NEXT: [3]: RThroughput
	# CHECK-NEXT: [4]: MayLoad			# CHECK-NEXT: [4]: MayLoad
	# CHECK-NEXT: [5]: MayStore			# CHECK-NEXT: [5]: MayStore
	# CHECK-NEXT: [6]: HasSideEffects (U)			# CHECK-NEXT: [6]: HasSideEffects (U)

	# CHECK: [1] [2] [3] [4] [5] [6] Instructions:			# CHECK: [1] [2] [3] [4] [5] [6] Instructions:
	# CHECK-NEXT: 1 8 8.00 sdiv w12, w21, w0			# CHECK-NEXT: 1 8 8.00 sdiv w12, w21, w0
	# CHECK-NEXT: 1 3 0.50 add w8, w8, #1			# CHECK-NEXT: 1 3 0.50 add w8, w8, #1
	# CHECK-NEXT: 1 3 0.50 add w1, w2, w0			# CHECK-NEXT: 1 3 0.50 add w1, w2, w0
	# CHECK-NEXT: 1 3 0.50 add w3, w4, #1			# CHECK-NEXT: 1 3 0.50 add w3, w4, #1
	# CHECK-NEXT: 1 3 0.50 add w5, w6, w0			# CHECK-NEXT: 1 3 0.50 add w5, w6, w0
	# CHECK-NEXT: 1 3 0.50 add w7, w9, w0			# CHECK-NEXT: 1 3 0.50 add w7, w9, w0

	# CHECK: Dynamic Dispatch Stall Cycles:			# CHECK: Dynamic Dispatch Stall Cycles:
	# CHECK-NEXT: RAT - Register unavailable: 0			# CHECK-NEXT: RAT - Register unavailable: 0
	# CHECK-NEXT: RCU - Retire tokens unavailable: 0			# CHECK-NEXT: RCU - Retire tokens unavailable: 0
	# CHECK-NEXT: SCHEDQ - Scheduler full: 0			# CHECK-NEXT: SCHEDQ - Scheduler full: 0
	# CHECK-NEXT: LQ - Load queue full: 0			# CHECK-NEXT: LQ - Load queue full: 0
	# CHECK-NEXT: SQ - Store queue full: 0			# CHECK-NEXT: SQ - Store queue full: 0
	# CHECK-NEXT: GROUP - Static restrictions on the dispatch group: 5 (27.8%)			# CHECK-NEXT: GROUP - Static restrictions on the dispatch group: 1 (5.0%)

	# CHECK: Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:			# CHECK: Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
	# CHECK-NEXT: [# dispatched], [# cycles]			# CHECK-NEXT: [# dispatched], [# cycles]
	# CHECK-NEXT: 0, 12 (66.7%)			# CHECK-NEXT: 0, 12 (60.0%)
	# CHECK-NEXT: 2, 6 (33.3%)			# CHECK-NEXT: 1, 4 (20.0%)
				# CHECK-NEXT: 2, 4 (20.0%)

	# CHECK: Schedulers - number of cycles where we saw N micro opcodes issued:			# CHECK: Schedulers - number of cycles where we saw N micro opcodes issued:
	# CHECK-NEXT: [# issued], [# cycles]			# CHECK-NEXT: [# issued], [# cycles]
	# CHECK-NEXT: 0, 12 (66.7%)			# CHECK-NEXT: 0, 12 (60.0%)
	# CHECK-NEXT: 2, 6 (33.3%)			# CHECK-NEXT: 1, 4 (20.0%)
				# CHECK-NEXT: 2, 4 (20.0%)

	# CHECK: Scheduler's queue usage:			# CHECK: Scheduler's queue usage:
	# CHECK-NEXT: No scheduler resources used.			# CHECK-NEXT: No scheduler resources used.

	# CHECK: Retire Control Unit - number of cycles where we saw N instructions retired:			# CHECK: Retire Control Unit - number of cycles where we saw N instructions retired:
	# CHECK-NEXT: [# retired], [# cycles]			# CHECK-NEXT: [# retired], [# cycles]
	# CHECK-NEXT: 0, 16 (88.9%)			# CHECK-NEXT: 0, 14 (70.0%)
	# CHECK-NEXT: 6, 2 (11.1%)			# CHECK-NEXT: 1, 2 (10.0%)
				# CHECK-NEXT: 2, 2 (10.0%)
				# CHECK-NEXT: 3, 2 (10.0%)

	# CHECK: Total ROB Entries: 64			# CHECK: Total ROB Entries: 64
	# CHECK-NEXT: Max Used ROB Entries: 8 ( 12.5% )			# CHECK-NEXT: Max Used ROB Entries: 7 ( 10.9% )
	# CHECK-NEXT: Average Used ROB Entries per cy: 5 ( 7.8% )			# CHECK-NEXT: Average Used ROB Entries per cy: 2 ( 3.1% )

	# CHECK: Register File statistics:			# CHECK: Register File statistics:
	# CHECK-NEXT: Total number of mappings created: 12			# CHECK-NEXT: Total number of mappings created: 12
	# CHECK-NEXT: Max number of mappings used: 8			# CHECK-NEXT: Max number of mappings used: 7

	# CHECK: Resources:			# CHECK: Resources:
	# CHECK-NEXT: [0.0] - CortexA55UnitALU			# CHECK-NEXT: [0.0] - CortexA55UnitALU
	# CHECK-NEXT: [0.1] - CortexA55UnitALU			# CHECK-NEXT: [0.1] - CortexA55UnitALU
	# CHECK-NEXT: [1] - CortexA55UnitB			# CHECK-NEXT: [1] - CortexA55UnitB
	# CHECK-NEXT: [2] - CortexA55UnitDiv			# CHECK-NEXT: [2] - CortexA55UnitDiv
	# CHECK-NEXT: [3.0] - CortexA55UnitFPALU			# CHECK-NEXT: [3.0] - CortexA55UnitFPALU
	# CHECK-NEXT: [3.1] - CortexA55UnitFPALU			# CHECK-NEXT: [3.1] - CortexA55UnitFPALU
	Show All 13 Lines
	# CHECK-NEXT: - - - 8.00 - - - - - - - - sdiv w12, w21, w0			# CHECK-NEXT: - - - 8.00 - - - - - - - - sdiv w12, w21, w0
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w8, w8, #1			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w8, w8, #1
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w1, w2, w0			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w1, w2, w0
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w3, w4, #1			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w3, w4, #1
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w5, w6, w0			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w5, w6, w0
	# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w7, w9, w0			# CHECK-NEXT: 0.50 0.50 - - - - - - - - - - add w7, w9, w0

	# CHECK: Timeline view:			# CHECK: Timeline view:
	# CHECK-NEXT: 01234567			# CHECK-NEXT: 0123456789
	# CHECK-NEXT: Index 0123456789			# CHECK-NEXT: Index 0123456789

	# CHECK: [0,0] DeeeeeeeER. . . sdiv w12, w21, w0			# CHECK: [0,0] DeeeeeeeER. . . sdiv w12, w21, w0
	# CHECK-NEXT: [0,1] DeeE-----R. . . add w8, w8, #1			# CHECK-NEXT: [0,1] . DeeER. . . add w8, w8, #1
	# CHECK-NEXT: [0,2] .DeeE----R. . . add w1, w2, w0			# CHECK-NEXT: [0,2] . DeeER. . . add w1, w2, w0
				andreadbUnsubmitted Not Done Reply Inline Actions tl;dr: not an inssue introduced by this patch. However, in future we may need to restrict the number of read/write ports available to prevent that too many instructions end up updating the PRF at the same time. All those instructions will reach write-back stage at the same time. I don't know how many write ports there are in the PRF of this processor. If it allows up to 2 writes per cycle, then the issue of the last instruction should be artificially delayed by one cycle. More in general, limits in the number of read/write ports in the register file are not currently modelled. It would be nice to add support for those in the future (assuming that this may become an important souce of inaccuracy especially when simulating in-order processors). The register file descriptor in the "extra processor info" struct is mainly a way to: limit the amount of phys registers for renaming enable/disable move elimination enable/disable the matching of zero-idioms. None of that is probably useful for in-order processors. We may instead want to add extra fields for the number of read/write ports. So that, when we check if an instruction can execute, we check the availability of read ports on issue, and we check the availability of write ports on write-back. Otherwise we stall the instruction. This would be a future development. How many write ports there are in the register file? andreadb: tl;dr: not an inssue introduced by this patch. However, in future we may need to restrict the…
	# CHECK-NEXT: [0,3] .DeeE----R. . . add w3, w4, #1			# CHECK-NEXT: [0,3] . .DeeER . . add w3, w4, #1
	# CHECK-NEXT: [0,4] . DeeE---R. . . add w5, w6, w0			# CHECK-NEXT: [0,4] . .DeeER . . add w5, w6, w0
	# CHECK-NEXT: [0,5] . DeeE---R. . . add w7, w9, w0			# CHECK-NEXT: [0,5] . . DeeER . . add w7, w9, w0
	# CHECK-NEXT: [1,0] . . DeeeeeeeER sdiv w12, w21, w0			# CHECK-NEXT: [1,0] . . DeeeeeeeER . sdiv w12, w21, w0
	# CHECK-NEXT: [1,1] . . DeeE-----R add w8, w8, #1			# CHECK-NEXT: [1,1] . . . DeeER . add w8, w8, #1
	# CHECK-NEXT: [1,2] . . DeeE----R add w1, w2, w0			# CHECK-NEXT: [1,2] . . . DeeER . add w1, w2, w0
	# CHECK-NEXT: [1,3] . . DeeE----R add w3, w4, #1			# CHECK-NEXT: [1,3] . . . DeeER. add w3, w4, #1
	# CHECK-NEXT: [1,4] . . DeeE---R add w5, w6, w0			# CHECK-NEXT: [1,4] . . . DeeER. add w5, w6, w0
	# CHECK-NEXT: [1,5] . . DeeE---R add w7, w9, w0			# CHECK-NEXT: [1,5] . . . DeeER add w7, w9, w0

	# CHECK: Average Wait times (based on the timeline view):			# CHECK: Average Wait times (based on the timeline view):
	# CHECK-NEXT: [0]: Executions			# CHECK-NEXT: [0]: Executions
	# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue			# CHECK-NEXT: [1]: Average time spent waiting in a scheduler's queue
	# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready			# CHECK-NEXT: [2]: Average time spent waiting in a scheduler's queue while ready
	# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage			# CHECK-NEXT: [3]: Average time elapsed from WB until retire stage

	# CHECK: [0] [1] [2] [3]			# CHECK: [0] [1] [2] [3]
	# CHECK-NEXT: 0. 2 0.0 0.0 0.0 sdiv w12, w21, w0			# CHECK-NEXT: 0. 2 0.0 0.0 0.0 sdiv w12, w21, w0
	# CHECK-NEXT: 1. 2 0.0 0.0 5.0 add w8, w8, #1			# CHECK-NEXT: 1. 2 0.0 0.0 0.0 add w8, w8, #1
	# CHECK-NEXT: 2. 2 0.0 0.0 4.0 add w1, w2, w0			# CHECK-NEXT: 2. 2 0.0 0.0 0.0 add w1, w2, w0
	# CHECK-NEXT: 3. 2 0.0 0.0 4.0 add w3, w4, #1			# CHECK-NEXT: 3. 2 0.0 0.0 0.0 add w3, w4, #1
	# CHECK-NEXT: 4. 2 0.0 0.0 3.0 add w5, w6, w0			# CHECK-NEXT: 4. 2 0.0 0.0 0.0 add w5, w6, w0
	# CHECK-NEXT: 5. 2 0.0 0.0 3.0 add w7, w9, w0			# CHECK-NEXT: 5. 2 0.0 0.0 0.0 add w7, w9, w0
	# CHECK-NEXT: 2 0.0 0.0 3.2 <total>			# CHECK-NEXT: 2 0.0 0.0 0.0 <total>