Delay the issue of a new instruction if that leads to out-of-order
commits of writes.
This patch fixes the problem described in:
https://bugs.llvm.org/show_bug.cgi?id=41796#c3
Differential D98604
[MCA] Ensure that writes occur in-order asavonic on Mar 14 2021, 10:15 AM. Authored by
Details Delay the issue of a new instruction if that leads to out-of-order This patch fixes the problem described in:
Diff Detail
Event Timeline
Comment Actions Hi Andrew, Thanks to your last comment, I think I know have a better idea of how this all should work.
OK. So out-of-order commit doesn't necessarily imply out-of-order execution. Makes sense. Back to your check: if (LastIssuedInst && !LastIssuedInst.getInstruction()->getDesc().RetireOOO) { // Delay the instruction to ensure that writes occur in program // order if (unsigned StallWritesOrder = checkWritesOrder(LastIssuedInst, IR)) { *StallCycles = StallWritesOrder; } } Unless I've read the code wrongly (which is a possibility...), then the check is only on the predecessor. According to that check, a RetiresOOO instruction following an ADD still needs to check for any potential structural hazards. It means that we still want it to reach the write-back stage in-order, even though - technically speaking - it is marked as RetiresOOO. Is there a reason why you don't allow the bypassing of the write-back hazard check for all RetiresOOO instructions? NOTE: normally this may not be a problem in practice, since RetiresOOO instructions (for what I have read) tend to have a high write latency. Still, it sounds a bit odd to me that we lift the check in one case, but not the other.
Can you elaborate why ADD needs to wait for the completion of FDIV? If FDIV is allowed to retire out-of-order, then it can retire after ADD, so ADD doesn't have to wait. So, the current semantic allows an ADD (or any other instruction following a RetireOOO instruction) to ignore the structural hazard check on the write-back. Example: [0,0] DeeeeeeeeeeeER . . . fdiv s1, s2, s3 [0,1] DeeER. . . . . add w8, w8, #1 However - at least in theory - while the FDIV can commit out-of-order, the ADD is not "special" in any way. In normal circumstances, I would have expected the ADD to be subject to the "usual" structural hazard checks for the write-back. Thankfully, your comment clarified this to me. You mentioned the following very important experiment: The model we have in MCA now seems to match the hardware (at least by IPC): [0,0] DeeeeeeeeeeeER . . . fdiv s1, s2, s3 [0,1] DeeER. . . . . add w8, w8, #1 [0,2] .DeeER . . . . add w1, w2, w0 [0,3] .DeeER . . . . add w3, w4, #1 [0,4] . DeeER . . . . add w5, w6, w0 [0,5] . . DeeeeeeeeeeeER . fdiv s4, s5, s6 [0,6] . . DeeER . . . add w7, w9, w0 [0,7] . . DeeER . . . add w3, w1, w3 [0,8] . . DeeER . . . add w7, w2, w4 [0,9] . . DeeER. . . add w5, w5, w9 The above example is very important because it clearly contradicts my original assumption on the write-back checks. Back to the code: Note that I had to fix the FDIV latency/resource-cycles in LLVM scheduler model. Original values match the ARM documentation, but I get different timings just by running FDIV in a loop. Unrelated to this patch. However, can it be that the FDIV latency is somehow dependent on the values passed in input? It may be worthy to run the same micro-benchmark with a variety of different values in input, to make sure that the execution units are not taking shortcuts. Just an idea. Comment Actions You're right, I missed that. The issue does not show up in the tests because of high latency of FDIV. This is now fixed in the new revision of the patch.
Right, this was the intention of the experiment. Floating-point divisions and square roots (VDIV and VSQRT instructions) operate out-of-order. These instructions always complete in a single cycle but there is a delay in the division or square root result being written to the register file. Subsequent instructions are allowed to issue, execute, and retire, provided they do not depend on the result of the VDIV, or VSQRT, and they are not VDIV or VSQRT themselves. If a dependency is detected, the pipeline interlocks and waits for the availability of the result before continuing. I think this confirms our findings. Although this part is a bit problematic: "Subsequent instructions are allowed to
Right, both points should be addressed in the new revision of the patch.
Documentation for Cortex-A55 does not mention this, and the manual for Comment Actions LGTM. Thanks a lot Andrew. P.s.: once this is committed, we can resume the review on the other patch. -Andrea
Yeah. This problem has to do with the presence of non-pipelined execution units. On Intel and AMD too, the divider is non-pipelined. So, when a division is issued, the divider gets busy (and therefore unavailable) for a given number of cycles, until the division terminates. As you wrote, the way how people normally address "issues" related to non-pipelined resources is by tweaking resource cycle consumptions in in the scheduling model. |
Why not just storing a number of cycles left?
You can easily compute that latency once, and then decrease it on every cycle (saturating to zero).
When you check for write-order hazards, you simply compare the longest write latency against that quantity. You won't need to recompute the cycles left from scratch starting from a LastIssuedInst.