This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Reverse the order of loading/storing callee-saved registers.
ClosedPublic

Authored by HsiangKai on Nov 16 2021, 12:24 AM.

Details

Summary

Currently, we restore the return address register as the last restoring
instruction in the epilog. The next instruction is ret usually. It is
a use of return address register. In some microarchitectures, there is
load-to-use data hazard. To avoid the load-to-use data hazard, we could
separate the load instruction from its use as far as possible. In this
patch, we reverse the order of loading/storing callee-saved registers to
increase the distance of load ra and ret in the epilog.

Diff Detail

Event Timeline

HsiangKai created this revision.Nov 16 2021, 12:24 AM
HsiangKai requested review of this revision.Nov 16 2021, 12:24 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 16 2021, 12:24 AM

What's the benefit of reversing the spills? Seems unnecessary to me. GCC will emit spills and loads in the same order for RISC-V.

Though, I'm unconvinced about whether the hazard in the epilogue matters in practice. In a simple in-order micoarchitecture if restoring ra stalls then it doesn't matter where that is, you can't execute out-of-order. In a superscalar core you'll just speculate past the ret anyway, which should predict accurately for anything with a half-decent RAS, though I guess you'll have unresolved speculation for longer that might be an issue. Is there a particular microarchitecture you've measured an appreciable difference for this on?

Though, I'm unconvinced about whether the hazard in the epilogue matters in practice. In a simple in-order micoarchitecture if restoring ra stalls then it doesn't matter where that is, you can't execute out-of-order. In a superscalar core you'll just speculate past the ret anyway, which should predict accurately for anything with a half-decent RAS, though I guess you'll have unresolved speculation for longer that might be an issue. Is there a particular microarchitecture you've measured an appreciable difference for this on?

What if there are two bubbles between load instruction and the using instruction in the pipeline? In the current RISC-V LLVM implementation, there is only one instruction, i.e., stack pointer adjustment, between load ra and ret. The ret will cost an additional cycle for the bubble unless we schedule the epilogue to fill the bubble. However, there is no scheduling for epilogue code in LLVM. In GCC implementation, it will put the load ra at first in the epilogue code. Here is an example. https://godbolt.org/z/3rnfrcnhT

I agree we could keep the order of prologue in the current implementation. I reorder them just for symmetry. It is not necessary in this patch. The point is to reorder the epilogue restoring code.

The intention of this patch is to have more instructions to fill the bubbles if there are any bubbles between load and use. It will increase the live range of ra, but it should have no negative impact in the epilogue code.

asb added a comment.Nov 17 2021, 7:07 AM

Though, I'm unconvinced about whether the hazard in the epilogue matters in practice. In a simple in-order micoarchitecture if restoring ra stalls then it doesn't matter where that is, you can't execute out-of-order. In a superscalar core you'll just speculate past the ret anyway, which should predict accurately for anything with a half-decent RAS, though I guess you'll have unresolved speculation for longer that might be an issue. Is there a particular microarchitecture you've measured an appreciable difference for this on?

The target is probably cores using a microarchitecture similar to Rocket. It has a "non-blocking D-cache" where the integer pipeline won't stall upon a cache miss until the register value is actually read (provided there are sufficient MSHRs).

llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
440

Worth adding a comment to explain why list is being iterated in reverse order.

1045

Worth adding a comment to explain why list is being iterated in reverse order.

1075

I think the comment above needs updating now, and again it would be worth explaining the desired iteration order.

Though, I'm unconvinced about whether the hazard in the epilogue matters in practice. In a simple in-order micoarchitecture if restoring ra stalls then it doesn't matter where that is, you can't execute out-of-order. In a superscalar core you'll just speculate past the ret anyway, which should predict accurately for anything with a half-decent RAS, though I guess you'll have unresolved speculation for longer that might be an issue. Is there a particular microarchitecture you've measured an appreciable difference for this on?

The target is probably cores using a microarchitecture similar to Rocket. It has a "non-blocking D-cache" where the integer pipeline won't stall upon a cache miss until the register value is actually read (provided there are sufficient MSHRs).

Can it actually issue additional loads in the meantime though, i.e. everything else that appears in between the restore and the stack adjustment? I wouldn't expect so for an in-order core, which is why I asked what I did. I'm struggling to see a case where this would ever matter.

HsiangKai updated this revision to Diff 387966.Nov 17 2021, 9:30 AM

Address @asb's comments.

Additional context I got from Andrew Waterman

The ability to partially hide the latency of a cache miss using a non-blocking cache is more of a secondary concern.  The latency of a load hit is the parameter we're more sensitive to.

The problem manifests primarily in in-order processors, where the ret instruction will be stalled until its operands are available, hence speculative execution past the ret instruction will not occur.  Single-issue in-order processors with at least two cycles of load-use delay are likely to manifest this stall.  In-order superscalars are likely to manifest this stall even with smaller load-use delays.

While this is much less of an issue for out-of-order processors, this change has some value even there: the earlier an instructions' operands are available, the lower its occupancy in the issue window, reducing pressure on issue-window capacity.

An ASCII pipeline diagram to demonstrate why reorder that could help:

Consider a typical RISC 5 stage pipeline CPU:

IF: Instruction Fetch
ID: Instruction Decode
EXE: Instruction Execute
MEM: Memory Access
WB: Write Back

Assume we need 2 cycle to read memory and NO forwording.

lw s1, 4(sp) # 4-byte Folded Reload
lw s0, 8(sp) # 4-byte Folded Reload
lw ra, 12(sp) # 4-byte Folded Reload
addi sp, sp, 16
ret
CycleIFIDEXEMEMWBComment
Nlw s1
N + 1lw s0lw s1
N + 2lw ralw s0lw s1
N + 3addilw ralw s0lw s1Reading sp + 4 (1st cycle)
N + 4addilw ralw s0lw s1Reading sp + 4 (stall for 2nd cycle)
N + 5retaddilw ralw s0lw s1Reading sp + 8 (1st cycle)
N + 6retaddilw ralw s0Reading sp + 8 (stall for 2st cycle)
N + 7retaddilw ralw s0Reading sp + 12 (1st cycle)
N + 8retaddilw raReading sp + 12 (stall for 2st cycle)
N + 9retaddilw raret stall due to ra isn't ready to read.
N + 10retaddi

...

lw ra, 12(sp) # 4-byte Folded Reload
lw s0, 8(sp) # 4-byte Folded Reload
lw s1, 4(sp) # 4-byte Folded Reload
addi sp, sp, 16
ret
CycleIFIDEXEMEMWBComment
Nlw ra
N + 1lw s0lw ra
N + 2lw s1lw s0lw ra
N + 3addilw s1lw s0lw raReading sp + 12 (1st cycle)
N + 4addilw s1lw s0lw raReading sp + 12 (stall for 2nd cycle)
N + 5retaddilw s1lw s0lw raReading sp + 8 (1st cycle)
N + 6retaddilw s1lw s0Reading sp + 8 (stall for 2st cycle)
N + 7retaddilw s1lw s0Reading sp + 12 (1st cycle)
N + 8retaddilw s1Reading sp + 12 (stall for 1st cycle)
N + 9retaddilw s1NO stall for ret!

...

Sure, I realise that if you have an extremely basic pipeline without forwarding then it does matter. I just didn't think anyone seriously cared about such microarchitectures in this day and age, or if they did then they at least didn't expect any kind of half-decent performance out of them. Not against this change, just surprised it's ever particularly noticeable.

In-order superscalar is a good point though, and the FU740 is precisely such a pipeline where you can issue both a load and a branch in the same cycle, so I guess the current codegen is hurting a noticeable amount there.

There is still lot of low-end RISC-V CPU used in embedded world, which might use such u-arch and benefit for such code gen change as I know :p

In-order superscalar is a good point though, and the FU740 is precisely such a pipeline where you can issue both a load and a branch in the same cycle, so I guess the current codegen is hurting a noticeable amount there.

Yeah, in-order superscalar is the case. We found the performance downgrade in U74 compared with GCC codegen.

Just some comment nits, otherwise ready to land IMO

llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1070

Missing "the" before "return address"

1071

I wouldn't talk about filling bubbles, there may not be any anyway. Just say it avoids potential load-to-use data hazards, as in the commit message.

1072

RISC-V registers don't have a $ prefix.

HsiangKai updated this revision to Diff 388787.Nov 21 2021, 6:56 PM

Address @jrtc27's comments.

asb accepted this revision.Nov 22 2021, 4:02 AM

LGTM, thanks!

This revision is now accepted and ready to land.Nov 22 2021, 4:02 AM