This is an archive of the discontinued LLVM Phabricator instance.

[MCA] Ensure that writes occur in-order
ClosedPublic

Authored by asavonic on Mar 14 2021, 10:15 AM.

Details

Summary

Delay the issue of a new instruction if that leads to out-of-order
commits of writes.

This patch fixes the problem described in:
https://bugs.llvm.org/show_bug.cgi?id=41796#c3

Diff Detail

Event Timeline

asavonic created this revision.Mar 14 2021, 10:15 AM
asavonic requested review of this revision.Mar 14 2021, 10:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 14 2021, 10:15 AM
foad added a subscriber: foad.Mar 14 2021, 12:16 PM
andreadb added inline comments.Mar 15 2021, 4:33 AM
llvm/include/llvm/MCA/Stages/InOrderIssueStage.h
41

Why not just storing a number of cycles left?

You can easily compute that latency once, and then decrease it on every cycle (saturating to zero).

When you check for write-order hazards, you simply compare the longest write latency against that quantity. You won't need to recompute the cycles left from scratch starting from a LastIssuedInst.

NOTE: it should not always reference the "last issued instruction". In case of OoO execution (see the comment below) the slowest write may not have been contributed by the last issued instruction. That quantity should always be the MAX write latency in general.
llvm/lib/MCA/Stages/InOrderIssueStage.cpp
152–158

There is still a bit of confusion around the meaning of flag RetireOOO, and what it actually means in practice for llvm-mca.

Over the last few days, I did some searches on "out-of-order commit" and what it actually means in practice. It is not to be confused with "out-of-order execution". Our RCU knows how to commit "in-order", but it doesn't know how to simulate "out-of-order" commit.

I don't have access to a cortex-a55, so I cannot run experiments on it.
However, I think that for now it is safe to just assume that RetireOOO means: out-of-order execution, plus out-of-order commit.

For out-of-order commit, we may not need to simulate an RCU. So I think your change on the RCU tokens is fine.

The question however is: what happens if you alternate instructions with in-order/out-of-order execution?
Example: FDIV, ADD, FDIV, ADD, ...

Your LastIssuedInst currently points to the instruction that reaches write-back stage last. However, you still have to account for the FDIV write latency when checking if the following ADD canIssue. ADDs must still be issued in-order. So it needs to wait for the completion of FDIV.

RetireOOO would become just a way to bypass the checkWritesOrder check.

Could you also add a test for sequences of {FDIV,ADD} pairs?

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-in-order-retire.s
106–108

tl;dr: not an inssue introduced by this patch. However, in future we may need to restrict the number of read/write ports available to prevent that too many instructions end up updating the PRF at the same time.

All those instructions will reach write-back stage at the same time. I don't know how many write ports there are in the PRF of this processor. If it allows up to 2 writes per cycle, then the issue of the last instruction should be artificially delayed by one cycle.

More in general, limits in the number of read/write ports in the register file are not currently modelled. It would be nice to add support for those in the future (assuming that this may become an important souce of inaccuracy especially when simulating in-order processors).
The register file descriptor in the "extra processor info" struct is mainly a way to:

  • limit the amount of phys registers for renaming
  • enable/disable move elimination
  • enable/disable the matching of zero-idioms.

None of that is probably useful for in-order processors. We may instead want to add extra fields for the number of read/write ports. So that, when we check if an instruction can execute, we check the availability of read ports on issue, and we check the availability of write ports on write-back. Otherwise we stall the instruction.

This would be a future development.

How many write ports there are in the register file?

asavonic updated this revision to Diff 330611.Mar 15 2021, 4:52 AM
asavonic edited the summary of this revision. (Show Details)

Disable RCU in a separate patch: D98628 [MCA] Disable RCU for InOrderIssueStage

asavonic added inline comments.Mar 15 2021, 11:15 AM
llvm/lib/MCA/Stages/InOrderIssueStage.cpp
152–158

There is still a bit of confusion around the meaning of flag RetireOOO, and what it actually means in practice for llvm-mca.

Over the last few days, I did some searches on "out-of-order commit" and what it actually means in practice. It is not to be confused with "out-of-order execution". Our RCU knows how to commit "in-order", but it doesn't know how to simulate "out-of-order" commit.

I think the intention is to allow some instructions to retire naturally, even if the order of retire does not match the program order.
In that case, RCU can do nothing and allow an instruction to retire as soon as it finishes execution.

I don't have access to a cortex-a55, so I cannot run experiments on it.
However, I think that for now it is safe to just assume that RetireOOO means: out-of-order execution, plus out-of-order commit.

For out-of-order commit, we may not need to simulate an RCU. So I think your change on the RCU tokens is fine.

The question however is: what happens if you alternate instructions with in-order/out-of-order execution?
Example: FDIV, ADD, FDIV, ADD, ...

I don't think that Cortex-A55 can execute instructions out-of-order. Some FP instructions can commit/retire out-of-order,
but they use a different set of registers and their timing characteristics are aligned (according to the documentation).

The model we have in MCA now seems to match the hardware (at least by IPC):

[0,0]     DeeeeeeeeeeeER .    .    .   fdiv	s1, s2, s3
[0,1]     DeeER.    .    .    .    .   add	w8, w8, #1
[0,2]     .DeeER    .    .    .    .   add	w1, w2, w0
[0,3]     .DeeER    .    .    .    .   add	w3, w4, #1
[0,4]     . DeeER   .    .    .    .   add	w5, w6, w0
[0,5]     .    .  DeeeeeeeeeeeER   .   fdiv	s4, s5, s6
[0,6]     .    .  DeeER  .    .    .   add	w7, w9, w0
[0,7]     .    .   DeeER .    .    .   add	w3, w1, w3
[0,8]     .    .   DeeER .    .    .   add	w7, w2, w4
[0,9]     .    .    DeeER.    .    .   add	w5, w5, w9

Note that I had to fix the FDIV latency/resource-cycles in LLVM scheduler model.
Original values match the ARM documentation, but I get different timings just by running FDIV in a loop.

The following sequence also matches the hardware:

[0,0]     DeeeeeeeeeeeER .    .    .    .    .   fdiv       s1, s2, s3
[0,1]     DeeER.    .    .    .    .    .    .   add        w8, w8, #1
[0,2]     .    .    . DeeeeeeeeeeeER    .    .   fdiv       s4, s2, s1
[0,3]     .    .    . DeeER   .    .    .    .   add        w1, w8, #1
[1,0]     .    .    .    .    DeeeeeeeeeeeER .   fdiv       s1, s2, s3
[1,1]     .    .    .    .    DeeER.    .    .   add        w8, w8, #1

Your LastIssuedInst currently points to the instruction that reaches write-back stage last. However, you still have to account for the FDIV write latency when checking if the following ADD canIssue. ADDs must still be issued in-order. So it needs to wait for the completion of FDIV.

RetireOOO would become just a way to bypass the checkWritesOrder check.

Can you elaborate why ADD needs to wait for the completion of FDIV?
If FDIV is allowed to retire out-of-order, then it can retire after ADD, so ADD doesn't have to wait.

Could you also add a test for sequences of {FDIV,ADD} pairs?

There is A55-out-of-order-retire.s test with FDIV and ADD sequence.
I can add the second example above as a LIT test as well.

Hi Andrew,

Thanks to your last comment, I think I know have a better idea of how this all should work.
I still have some questions though (see below).

I don't think that Cortex-A55 can execute instructions out-of-order. Some FP instructions can commit/retire out-of-order,
but they use a different set of registers and their timing characteristics are aligned (according to the documentation).

OK. So out-of-order commit doesn't necessarily imply out-of-order execution. Makes sense.

Back to your check:

if (LastIssuedInst && !LastIssuedInst.getInstruction()->getDesc().RetireOOO) {
  // Delay the instruction to ensure that writes occur in program
  // order
  if (unsigned StallWritesOrder = checkWritesOrder(LastIssuedInst, IR)) {
    *StallCycles = StallWritesOrder;
  }
}

Unless I've read the code wrongly (which is a possibility...), then the check is only on the predecessor.
This works for example where an ADD follows an FDIV. However, it doesn't work if the FDIV follows the ADD. Is this expected?

According to that check, a RetiresOOO instruction following an ADD still needs to check for any potential structural hazards. It means that we still want it to reach the write-back stage in-order, even though - technically speaking - it is marked as RetiresOOO.

Is there a reason why you don't allow the bypassing of the write-back hazard check for all RetiresOOO instructions?

NOTE: normally this may not be a problem in practice, since RetiresOOO instructions (for what I have read) tend to have a high write latency. Still, it sounds a bit odd to me that we lift the check in one case, but not the other.
Can you elaborate why ADD needs to wait for the completion of FDIV?
If FDIV is allowed to retire out-of-order, then it can retire after ADD, so ADD doesn't have to wait.

So, the current semantic allows an ADD (or any other instruction following a RetireOOO instruction) to ignore the structural hazard check on the write-back.

Example:

[0,0]     DeeeeeeeeeeeER .    .    .   fdiv	s1, s2, s3
[0,1]     DeeER.    .    .    .    .   add	w8, w8, #1

However - at least in theory - while the FDIV can commit out-of-order, the ADD is not "special" in any way. In normal circumstances, I would have expected the ADD to be subject to the "usual" structural hazard checks for the write-back.

Thankfully, your comment clarified this to me. You mentioned the following very important experiment:

The model we have in MCA now seems to match the hardware (at least by IPC):

[0,0]     DeeeeeeeeeeeER .    .    .   fdiv	s1, s2, s3
[0,1]     DeeER.    .    .    .    .   add	w8, w8, #1
[0,2]     .DeeER    .    .    .    .   add	w1, w2, w0
[0,3]     .DeeER    .    .    .    .   add	w3, w4, #1
[0,4]     . DeeER   .    .    .    .   add	w5, w6, w0
[0,5]     .    .  DeeeeeeeeeeeER   .   fdiv	s4, s5, s6
[0,6]     .    .  DeeER  .    .    .   add	w7, w9, w0
[0,7]     .    .   DeeER .    .    .   add	w3, w1, w3
[0,8]     .    .   DeeER .    .    .   add	w7, w2, w4
[0,9]     .    .    DeeER.    .    .   add	w5, w5, w9

The above example is very important because it clearly contradicts my original assumption on the write-back checks.
If the write-back check was still performed, then the ADD would be in the critical path, and the block latency would be proportional to the length of the ADDs tail.
However, your comment suggests that benchmarks report a smaller latency, and therefore the ADD sequence latency is hidden by the FDIV latency. Meaning, that those ADDs are not in the critical path. It should become even more obvious if you make that tail of ADDs even longer (but still small enough to be fully hidden by the FDIV latency).

Back to the code:
I still think that you only need a latency value instead of storing LastIssuedInst. Also, shouldn't LastIssuedInst always point to the last non-RetireOOO instruction?

Note that I had to fix the FDIV latency/resource-cycles in LLVM scheduler model.
Original values match the ARM documentation, but I get different timings just by running FDIV in a loop.

Unrelated to this patch. However, can it be that the FDIV latency is somehow dependent on the values passed in input? It may be worthy to run the same micro-benchmark with a variety of different values in input, to make sure that the execution units are not taking shortcuts. Just an idea.

Is there a reason why you don't allow the bypassing of the write-back hazard check for all RetiresOOO instructions?

NOTE: normally this may not be a problem in practice, since RetiresOOO instructions (for what I have read) tend to have a high write latency. Still, it sounds a bit odd to me that we lift the check in one case, but not the other.

You're right, I missed that. The issue does not show up in the tests because of high latency of FDIV. This is now fixed in the new revision of the patch.

The above example is very important because it clearly contradicts my original assumption on the write-back checks.
If the write-back check was still performed, then the ADD would be in the critical path, and the block latency would be proportional to the length of the ADDs tail.
However, your comment suggests that benchmarks report a smaller latency, and therefore the ADD sequence latency is hidden by the FDIV latency. Meaning, that those ADDs are not in the critical path. It should become even more obvious if you make that tail of ADDs even longer (but still small enough to be fully hidden by the FDIV latency).

Right, this was the intention of the experiment.
I also found that the same feature is described differently in the Cortex-R52 manual:

Floating-point divisions and square roots (VDIV and VSQRT instructions)
operate out-of-order. These instructions always complete in a single cycle
but there is a delay in the division or square root result being written to
the register file. Subsequent instructions are allowed to issue, execute,
and retire, provided they do not depend on the result of the VDIV, or VSQRT,
and they are not VDIV or VSQRT themselves. If a dependency is detected, the
pipeline interlocks and waits for the availability of the result before
continuing.

I think this confirms our findings.

Although this part is a bit problematic: "Subsequent instructions are allowed to
issue, execute, and retire, provided they do not depend on the result of the
VDIV, or VSQRT, and they are not VDIV or VSQRT themselves". MCA does not handle
this, but this should never happen (at least for A55), because the FDIV/VSQRT
instructions go through the same FP-pipeline, and take ResourceCycles
according to their latency.

Back to the code:
I still think that you only need a latency value instead of storing LastIssuedInst. Also, shouldn't LastIssuedInst always point to the last non-RetireOOO instruction?

Right, both points should be addressed in the new revision of the patch.

Note that I had to fix the FDIV latency/resource-cycles in LLVM scheduler model.
Original values match the ARM documentation, but I get different timings just by running FDIV in a loop.

Unrelated to this patch. However, can it be that the FDIV latency is somehow dependent on the values passed in input? It may be worthy to run the same micro-benchmark with a variety of different values in input, to make sure that the execution units are not taking shortcuts. Just an idea.

Documentation for Cortex-A55 does not mention this, and the manual for
Cortex-R52 states that latencies for FDIV are data-independent. But that is a
good point. I should set all registers to specific values, and then change them
to see the difference.

asavonic updated this revision to Diff 331088.Mar 16 2021, 1:26 PM

Store LastWriteBackCycle instead of recomputing it.

andreadb accepted this revision.Mar 17 2021, 3:57 AM

LGTM.

Thanks a lot Andrew.

P.s.: once this is committed, we can resume the review on the other patch.

-Andrea

Although this part is a bit problematic: "Subsequent instructions are allowed to
issue, execute, and retire, provided they do not depend on the result of the
VDIV, or VSQRT, and they are not VDIV or VSQRT themselves". MCA does not handle
this, but this should never happen (at least for A55), because the FDIV/VSQRT
instructions go through the same FP-pipeline, and take ResourceCycles
according to their latency.

Yeah. This problem has to do with the presence of non-pipelined execution units. On Intel and AMD too, the divider is non-pipelined. So, when a division is issued, the divider gets busy (and therefore unavailable) for a given number of cycles, until the division terminates. As you wrote, the way how people normally address "issues" related to non-pipelined resources is by tweaking resource cycle consumptions in in the scheduling model.

This revision is now accepted and ready to land.Mar 17 2021, 3:57 AM
This revision was landed with ongoing or failed builds.Mar 18 2021, 7:11 AM
This revision was automatically updated to reflect the committed changes.