Currently, we restore the return address register as the last restoring
instruction in the epilog. The next instruction is ret usually. It is
a use of return address register. In some microarchitectures, there is
load-to-use data hazard. To avoid the load-to-use data hazard, we could
separate the load instruction from its use as far as possible. In this
patch, we reverse the order of loading/storing callee-saved registers to
increase the distance of load ra and ret in the epilog.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
What's the benefit of reversing the spills? Seems unnecessary to me. GCC will emit spills and loads in the same order for RISC-V.
Though, I'm unconvinced about whether the hazard in the epilogue matters in practice. In a simple in-order micoarchitecture if restoring ra stalls then it doesn't matter where that is, you can't execute out-of-order. In a superscalar core you'll just speculate past the ret anyway, which should predict accurately for anything with a half-decent RAS, though I guess you'll have unresolved speculation for longer that might be an issue. Is there a particular microarchitecture you've measured an appreciable difference for this on?
What if there are two bubbles between load instruction and the using instruction in the pipeline? In the current RISC-V LLVM implementation, there is only one instruction, i.e., stack pointer adjustment, between load ra and ret. The ret will cost an additional cycle for the bubble unless we schedule the epilogue to fill the bubble. However, there is no scheduling for epilogue code in LLVM. In GCC implementation, it will put the load ra at first in the epilogue code. Here is an example. https://godbolt.org/z/3rnfrcnhT
I agree we could keep the order of prologue in the current implementation. I reorder them just for symmetry. It is not necessary in this patch. The point is to reorder the epilogue restoring code.
The intention of this patch is to have more instructions to fill the bubbles if there are any bubbles between load and use. It will increase the live range of ra, but it should have no negative impact in the epilogue code.
The target is probably cores using a microarchitecture similar to Rocket. It has a "non-blocking D-cache" where the integer pipeline won't stall upon a cache miss until the register value is actually read (provided there are sufficient MSHRs).
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp | ||
---|---|---|
440 | Worth adding a comment to explain why list is being iterated in reverse order. | |
1045 | Worth adding a comment to explain why list is being iterated in reverse order. | |
1075 | I think the comment above needs updating now, and again it would be worth explaining the desired iteration order. |
Can it actually issue additional loads in the meantime though, i.e. everything else that appears in between the restore and the stack adjustment? I wouldn't expect so for an in-order core, which is why I asked what I did. I'm struggling to see a case where this would ever matter.
Additional context I got from Andrew Waterman
The ability to partially hide the latency of a cache miss using a non-blocking cache is more of a secondary concern. The latency of a load hit is the parameter we're more sensitive to. The problem manifests primarily in in-order processors, where the ret instruction will be stalled until its operands are available, hence speculative execution past the ret instruction will not occur. Single-issue in-order processors with at least two cycles of load-use delay are likely to manifest this stall. In-order superscalars are likely to manifest this stall even with smaller load-use delays. While this is much less of an issue for out-of-order processors, this change has some value even there: the earlier an instructions' operands are available, the lower its occupancy in the issue window, reducing pressure on issue-window capacity.
An ASCII pipeline diagram to demonstrate why reorder that could help:
Consider a typical RISC 5 stage pipeline CPU:
IF: Instruction Fetch
ID: Instruction Decode
EXE: Instruction Execute
MEM: Memory Access
WB: Write Back
Assume we need 2 cycle to read memory and NO forwording.
lw s1, 4(sp) # 4-byte Folded Reload lw s0, 8(sp) # 4-byte Folded Reload lw ra, 12(sp) # 4-byte Folded Reload addi sp, sp, 16 ret
Cycle | IF | ID | EXE | MEM | WB | Comment |
N | lw s1 | |||||
N + 1 | lw s0 | lw s1 | ||||
N + 2 | lw ra | lw s0 | lw s1 | |||
N + 3 | addi | lw ra | lw s0 | lw s1 | Reading sp + 4 (1st cycle) | |
N + 4 | addi | lw ra | lw s0 | lw s1 | Reading sp + 4 (stall for 2nd cycle) | |
N + 5 | ret | addi | lw ra | lw s0 | lw s1 | Reading sp + 8 (1st cycle) |
N + 6 | ret | addi | lw ra | lw s0 | Reading sp + 8 (stall for 2st cycle) | |
N + 7 | ret | addi | lw ra | lw s0 | Reading sp + 12 (1st cycle) | |
N + 8 | ret | addi | lw ra | Reading sp + 12 (stall for 2st cycle) | ||
N + 9 | ret | addi | lw ra | ret stall due to ra isn't ready to read. | ||
N + 10 | ret | addi | ||||
...
lw ra, 12(sp) # 4-byte Folded Reload lw s0, 8(sp) # 4-byte Folded Reload lw s1, 4(sp) # 4-byte Folded Reload addi sp, sp, 16 ret
Cycle | IF | ID | EXE | MEM | WB | Comment |
N | lw ra | |||||
N + 1 | lw s0 | lw ra | ||||
N + 2 | lw s1 | lw s0 | lw ra | |||
N + 3 | addi | lw s1 | lw s0 | lw ra | Reading sp + 12 (1st cycle) | |
N + 4 | addi | lw s1 | lw s0 | lw ra | Reading sp + 12 (stall for 2nd cycle) | |
N + 5 | ret | addi | lw s1 | lw s0 | lw ra | Reading sp + 8 (1st cycle) |
N + 6 | ret | addi | lw s1 | lw s0 | Reading sp + 8 (stall for 2st cycle) | |
N + 7 | ret | addi | lw s1 | lw s0 | Reading sp + 12 (1st cycle) | |
N + 8 | ret | addi | lw s1 | Reading sp + 12 (stall for 1st cycle) | ||
N + 9 | ret | addi | lw s1 | NO stall for ret! | ||
...
Sure, I realise that if you have an extremely basic pipeline without forwarding then it does matter. I just didn't think anyone seriously cared about such microarchitectures in this day and age, or if they did then they at least didn't expect any kind of half-decent performance out of them. Not against this change, just surprised it's ever particularly noticeable.
In-order superscalar is a good point though, and the FU740 is precisely such a pipeline where you can issue both a load and a branch in the same cycle, so I guess the current codegen is hurting a noticeable amount there.
There is still lot of low-end RISC-V CPU used in embedded world, which might use such u-arch and benefit for such code gen change as I know :p
Yeah, in-order superscalar is the case. We found the performance downgrade in U74 compared with GCC codegen.
Just some comment nits, otherwise ready to land IMO
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp | ||
---|---|---|
1070 | Missing "the" before "return address" | |
1071 | I wouldn't talk about filling bubbles, there may not be any anyway. Just say it avoids potential load-to-use data hazards, as in the commit message. | |
1072 | RISC-V registers don't have a $ prefix. |
Worth adding a comment to explain why list is being iterated in reverse order.