Page MenuHomePhabricator

[RISCV] MC relaxation for out-of-range conditional branch.
Needs ReviewPublic

Authored by HsiangKai on Aug 30 2021, 5:47 PM.

Details

Summary

If .L1 is not within +-4KiB range,

convert

bge a0, a1, .L1

to

blt a0, a1, 8
j   .L1

In this patch, if the symbol is unresolved at assembly time, do not do
this relaxation.

Fix the bug reported in https://bugs.llvm.org/show_bug.cgi?id=47910

Diff Detail

Event Timeline

HsiangKai created this revision.Aug 30 2021, 5:47 PM
HsiangKai requested review of this revision.Aug 30 2021, 5:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 30 2021, 5:47 PM
HsiangKai edited the summary of this revision. (Show Details)Aug 30 2021, 5:49 PM
HsiangKai edited the summary of this revision. (Show Details)

I know GNU as does this but I don't see why that's a good idea nor why we should; if you use a conditional branch to a label that's too far away then that's your problem and you should have written non-broken code. Having it insert the long branch form is too much magic for me...

asb added a comment.Sep 2 2021, 6:44 AM

I know GNU as does this but I don't see why that's a good idea nor why we should; if you use a conditional branch to a label that's too far away then that's your problem and you should have written non-broken code. Having it insert the long branch form is too much magic for me...

Yes, I'm in two minds about this. I'm usually in favour of trying to match GNU tools behaviour in order to minimise porting issues / provide a more predictable user experience. But fighting against that, I'm really not a fan of these "magic" transformations either.

The fear with adding anything like this is of course adding a very rarely executed path that ends up having bugs in obscure cases. As such, if we were to go in this direction I'd be keen to see more extensive test coverage.

jrtc27 added a comment.Sep 2 2021, 6:58 AM

I know GNU as does this but I don't see why that's a good idea nor why we should; if you use a conditional branch to a label that's too far away then that's your problem and you should have written non-broken code. Having it insert the long branch form is too much magic for me...

Yes, I'm in two minds about this. I'm usually in favour of trying to match GNU tools behaviour in order to minimise porting issues / provide a more predictable user experience. But fighting against that, I'm really not a fan of these "magic" transformations either.

The fear with adding anything like this is of course adding a very rarely executed path that ends up having bugs in obscure cases. As such, if we were to go in this direction I'd be keen to see more extensive test coverage.

FWIW, I *think* the origin of this stems from:

(a) GCC/binutils divide
(b) GCC uses pseudos in its output (even uses li in some cases, though often mixed in with other shifts and adds for decomposing immediates, bit odd)
(c) GCC has limited knowledge of compression

so GCC doesn't actually know how far away things are and thus can't pick the right branch all the time. Fixing it up in the assembler was presumably the easier solution rather than fixing GCC to emit correct assembly. That then has the unfortunate effect of becoming something hand-written assembly relies on (e.g. bbl was, as was FreeBSD in one place very early on before LLVM was mature enough to build it).

asb added a comment.Sep 2 2021, 8:27 AM

FWIW, I *think* the origin of this stems from:

(a) GCC/binutils divide
(b) GCC uses pseudos in its output (even uses li in some cases, though often mixed in with other shifts and adds for decomposing immediates, bit odd)
(c) GCC has limited knowledge of compression

so GCC doesn't actually know how far away things are and thus can't pick the right branch all the time. Fixing it up in the assembler was presumably the easier solution rather than fixing GCC to emit correct assembly. That then has the unfortunate effect of becoming something hand-written assembly relies on (e.g. bbl was, as was FreeBSD in one place very early on before LLVM was mature enough to build it).

That's a good point re the origin of this kind of magic. It probably strengthens the argument in my mind for supporting this - being able to take .s generated by GCC and assemble it with clang/LLVM is a good thing (and we should probably be doing more such testing).

kito-cheng added a comment.EditedSep 2 2021, 9:12 AM

FWIW, I *think* the origin of this stems from:

(a) GCC/binutils divide
(b) GCC uses pseudos in its output (even uses li in some cases, though often mixed in with other shifts and adds for decomposing immediates, bit odd)
(c) GCC has limited knowledge of compression

so GCC doesn't actually know how far away things are and thus can't pick the right branch all the time. Fixing it up in the assembler was presumably the easier solution rather than fixing GCC to emit correct assembly. That then has the unfortunate effect of becoming something hand-written assembly relies on (e.g. bbl was, as was FreeBSD in one place very early on before LLVM was mature enough to build it).

That's a good point re the origin of this kind of magic. It probably strengthens the argument in my mind for supporting this - being able to take .s generated by GCC and assemble it with clang/LLVM is a good thing (and we should probably be doing more such testing).

The description about GCC is not fully correct, GCC also has capability estimate jump range, and convert cond branch to inverted cond branch and a jump during the compilation stage, GCC didn't emit any compressed instruction so the instruction distance estimation is over-conservative.

So when we need this magic?

  1. GCC didn't have integrated assembler like LLVM, so instruction length for inline asm can only use a very roughly way to estimate.
  2. Mostly this is used for hand-written assembly file, RISC-V conditional branch only provide very short range compare to other RISC ISA; +-32K for MIPS, +-32M for ARM32, +-32K(TBZ/TBNZ) ~ +-1M(CBZ/CBNZ) for AArch64, but only +-2K for RISC-V, that made RISC-V is more easy to hit out-of-range condition branch issue, of cause we can ask programmer to convert this by themselves, but out-of-range error is reported until linker time because RISC-V has relaxation, that made many user confuse, (relocation truncate to fit???), this magic can prevent that confusion in most case.
jrtc27 added a comment.Sep 2 2021, 9:18 AM

FWIW, I *think* the origin of this stems from:

(a) GCC/binutils divide
(b) GCC uses pseudos in its output (even uses li in some cases, though often mixed in with other shifts and adds for decomposing immediates, bit odd)
(c) GCC has limited knowledge of compression

so GCC doesn't actually know how far away things are and thus can't pick the right branch all the time. Fixing it up in the assembler was presumably the easier solution rather than fixing GCC to emit correct assembly. That then has the unfortunate effect of becoming something hand-written assembly relies on (e.g. bbl was, as was FreeBSD in one place very early on before LLVM was mature enough to build it).

That's a good point re the origin of this kind of magic. It probably strengthens the argument in my mind for supporting this - being able to take .s generated by GCC and assemble it with clang/LLVM is a good thing (and we should probably be doing more such testing).

The description about GCC is not fully correct, GCC also has capability estimate jump range, and convert cond branch to inverted cond branch and a jump during the compilation stage, GCC didn't emit any compressed instruction so the instruction distance estimation is over-conservative.

So when we need this magic?

  1. GCC didn't have integrated assembler like LLVM, so instruction length for inline asm can only use a very roughly way to estimate.

Is that not a problem for every architecture though?

  1. Mostly this is used for hand-written assembly file, RISC-V conditional branch only provide very short range compare to other RISC ISA; +-32K for MIPS, +-32M for ARM32, +-32K(TBZ/TBNZ) ~ +-1M(CBZ/CBNZ) for AArch64, but only +-2K for RISC-V, that made RISC-V is more easy to hit out-of-range condition branch issue, of cause we can ask programmer to convert this by themselves, but out-of-range error is reported until linker time because RISC-V has relaxation, that made many user confuse, (relocation truncate to fit???), this magic can prevent that confusion in most case.

There's nothing stopping an assembler from detecting that a label will/won't/might be out of range for a given branch by doing best-case and worst-case analysis on the instruction sequence and emitting an error in the case where it's definitely going to be out of bounds however much you relax, and a warning when it's unsure because it depends on how much relaxation there is (which is likely to be relatively rare, normally these cases are _way_ out of bounds for user-provided assembly).

So when we need this magic?

  1. GCC didn't have integrated assembler like LLVM, so instruction length for inline asm can only use a very roughly way to estimate.

Is that not a problem for every architecture though?

Yeah, but RISC-V is more frequently hit due to shorter range for conditional branch, I believe integrated assembler is right way to resolve this, but sadly it's not existing in GCC :(

  1. Mostly this is used for hand-written assembly file, RISC-V conditional branch only provide very short range compare to other RISC ISA; +-32K for MIPS, +-32M for ARM32, +-32K(TBZ/TBNZ) ~ +-1M(CBZ/CBNZ) for AArch64, but only +-2K for RISC-V, that made RISC-V is more easy to hit out-of-range condition branch issue, of cause we can ask programmer to convert this by themselves, but out-of-range error is reported until linker time because RISC-V has relaxation, that made many user confuse, (relocation truncate to fit???), this magic can prevent that confusion in most case.

There's nothing stopping an assembler from detecting that a label will/won't/might be out of range for a given branch by doing best-case and worst-case analysis on the instruction sequence and emitting an error in the case where it's definitely going to be out of bounds however much you relax, and a warning when it's unsure because it depends on how much relaxation there is (which is likely to be relatively rare, normally these cases are _way_ out of bounds for user-provided assembly).

Technically that's feasible solution, but I think that's would be more like philosophical issue here :p

So when we need this magic?

  1. GCC didn't have integrated assembler like LLVM, so instruction length for inline asm can only use a very roughly way to estimate.

Is that not a problem for every architecture though?

Yeah, but RISC-V is more frequently hit due to shorter range for conditional branch, I believe integrated assembler is right way to resolve this, but sadly it's not existing in GCC :(

I don't particularly like adding magic to RISC-V just because it hits an issue shared by many architectures more often than them. If I can break every architecture other than RISC-V with a sufficiently-large blob of inline assembly then it seems to me like GCC needs fixing, not that RISC-V should have a special hacky workaround just for it that gets exposed to and relied upon by users?

  1. Mostly this is used for hand-written assembly file, RISC-V conditional branch only provide very short range compare to other RISC ISA; +-32K for MIPS, +-32M for ARM32, +-32K(TBZ/TBNZ) ~ +-1M(CBZ/CBNZ) for AArch64, but only +-2K for RISC-V, that made RISC-V is more easy to hit out-of-range condition branch issue, of cause we can ask programmer to convert this by themselves, but out-of-range error is reported until linker time because RISC-V has relaxation, that made many user confuse, (relocation truncate to fit???), this magic can prevent that confusion in most case.

There's nothing stopping an assembler from detecting that a label will/won't/might be out of range for a given branch by doing best-case and worst-case analysis on the instruction sequence and emitting an error in the case where it's definitely going to be out of bounds however much you relax, and a warning when it's unsure because it depends on how much relaxation there is (which is likely to be relatively rare, normally these cases are _way_ out of bounds for user-provided assembly).

Technically that's feasible solution, but I think that's would be more like philosophical issue here :p

MaskRay added a comment.EditedSep 2 2021, 10:28 AM

Some objection from me.

x86 does have such a relaxation (breaking WYSIWYG), but this is not a thing that any other RISC arch I know does.
WYSIWYG matters a bit because aligned branched for loops may have performance advantage. Unpredictable assembler relaxation can break it.
(On x86 when Intel jump condition code erratum was discussed) it is said certain instructions have "side effect after exactly one or two instruction", arbitrarily replacing one instruction has some risk. )
thumb1 has a short range (+-256) as well, but it doesn't use assembler relaxation:

# thumb: Error: branch out of range
.thumb
beq .Lfoo
.rept 129
nop
.endr
.Lfoo:

RISCV already uses lib/CodeGen/BranchRelaxation.cpp on the compiler side, so I am not sure why we need this assembler side solution.

https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html#Size-of-an-asm says
"It does this by counting the number of instructions in the pattern of the asm "
Multiple RISC ports use this. If GCC RISCV doesn't do this yet, it should be fixed.

For completeness, such long branches are typically not performance bottleneck.
Even if linker relaxation can shorten the distance a bit and allow more short-form branches,
it may not worth the complexity.

Will the BranchRelaxation pass account for .insn directives in inline asm correctly? Or will we need to do additional work for that associated with D108602

Will the BranchRelaxation pass account for .insn directives in inline asm correctly? Or will we need to do additional work for that associated with D108602

I haven't looked at it closely yet. BranchRelaxation.cpp uses TargetInstrInfo::getInstSizeInBytes to compute the size.

craig.topper edited the summary of this revision. (Show Details)Sep 2 2021, 10:42 AM

https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html#Size-of-an-asm says
"It does this by counting the number of instructions in the pattern of the asm "
Multiple RISC ports use this. If GCC RISCV doesn't do this yet, it should be fixed.

That's not target dependent feature, and it's roughly way to estimate, unless GCC has something like MC layer, otherwise it's never accurate.

For completeness, such long branches are typically not performance bottleneck.
Even if linker relaxation can shorten the distance a bit and allow more short-form branches,
it may not worth the complexity.

RISC-V don't have special relocation to handle this relaxation, and binutils didn't handle that, although that could be recognized by R_RISCV_BRANCH + R_RISCV_JAL pair, but as you said it's might be complicated and not worth.

It is not a new feature for RISC-V. We already have c.beqz to beq, c.bnez to bne, c.j to jal, etc. This patch just extends the relaxation patterns to conditional branches. If MC relaxation is not a good idea, should we remove all the patterns in the relaxInstruction target hook?

jrtc27 added a comment.Sep 6 2021, 6:45 PM

It is not a new feature for RISC-V. We already have c.beqz to beq, c.bnez to bne, c.j to jal, etc. This patch just extends the relaxation patterns to conditional branches. If MC relaxation is not a good idea, should we remove all the patterns in the relaxInstruction target hook?

My understanding is that's more a quirk of our implementation and is there to un-compress branches that get erroneously compressed (since a bare symbol is a valid operand for compressed branches), but yes, that does currently cause things to behave a bit strangely and ideally wouldn't be there.

binutils also does this for at least Xtensa http://sourceware.org/binutils/docs/as/Xtensa-Branch-Relaxation.html

There was this pull request to document the GNU behavior in the riscv-asm-manual https://github.com/riscv-non-isa/riscv-asm-manual/pull/58/files

The inline assembly counting in LLVM is known to be incomplete see https://bugs.llvm.org/show_bug.cgi?id=42539

I think maybe .insn will work with the current implementation of getInlineAsmLength because it mostly just counts the number of lines and multiplies by MaxInstLength. With a special case for .space directive.

binutils also does this for at least Xtensa http://sourceware.org/binutils/docs/as/Xtensa-Branch-Relaxation.html

There was this pull request to document the GNU behavior in the riscv-asm-manual https://github.com/riscv-non-isa/riscv-asm-manual/pull/58/files

The inline assembly counting in LLVM is known to be incomplete see https://bugs.llvm.org/show_bug.cgi?id=42539

I think maybe .insn will work with the current implementation of getInlineAsmLength because it mostly just counts the number of lines and multiplies by MaxInstLength. With a special case for .space directive.

If we deal with special directives such as .rept in getInlineAsmLength, we should be able to get the correct branch instruction according to the estimated offset. I agree that the compiler should be responsible to generate the correct code, not depends on MC to fix it. However, hand-written assembly code is another story. How about to turn MC relaxation off by default and let users turn on it if users want the compiler to help them to take care of these transformation?

I found some pseudo instruction are expanded to multiple MI instructions in RISCVExpandPseudoInsts.cpp. And the BranchRelaxation pass will run before the expandPseudo pass. Is it possible to cause the fixup value out-of-range on some branch instructions? If this will happen, maybe add MC layer branch relaxation can deal with it.

I found some pseudo instruction are expanded to multiple MI instructions in RISCVExpandPseudoInsts.cpp. And the BranchRelaxation pass will run before the expandPseudo pass. Is it possible to cause the fixup value out-of-range on some branch instructions? If this will happen, maybe add MC layer branch relaxation can deal with it.

Not unless RISCVInstrInfo::getInstSizeInBytes has missing or incorrect cases.

I found some pseudo instruction are expanded to multiple MI instructions in RISCVExpandPseudoInsts.cpp. And the BranchRelaxation pass will run before the expandPseudo pass. Is it possible to cause the fixup value out-of-range on some branch instructions? If this will happen, maybe add MC layer branch relaxation can deal with it.

Not unless RISCVInstrInfo::getInstSizeInBytes has missing or incorrect cases.

Got it. Thanks!