The short jump/call costs less cycles than the long jump/call.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
lld/ELF/Arch/AVR.cpp | ||
---|---|---|
245 | Is there any better we can remove this nop ? I can only figure out the way by memcpy uint8_t *loc . |
lld/ELF/Arch/AVR.cpp | ||
---|---|---|
245 | I implemented this for RISC-V but I am unsure we should get the complexity for the less-used AVR port. This complexity is exactly what I called out in a previous patch and you said that you did not intend to implement it. |
lld/ELF/Arch/AVR.cpp | ||
---|---|---|
245 | I see. I will not pursue removing the NOP any more. And I think current form long jump -> short jump + nop is simple enough, at least 1 cycle is saved. Hopefully you will accept. ^_^ |
Apologies but this patch gives a feeling of overengineering for a less-popular (experimental) arch. Adding code with unclear benefits... Is it measurable?
Replacing an instruction to two likely makes the execution slower, so I think since we don't implement linker relaxation for AVR, we should not do the jmp/call rewriting as well.
My change is measurable,
- short jmp/call cost 1 less tick than long jmp/call
- the nop costs 1 tick.
So the rewriting of call has neither improvement nor regression, since the nop is always executed, so there is no tick/space change.
But the rewriting of jmp does have improvement on time, since the nop is no longer executed, one tick can be saved.
Please refer to gun-ld,
https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=bfd/elf32-avr.c;h=702719136d09acbc8c98ec49ab8129d0f33fffa8;hb=HEAD#l2721
https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=bfd/elf32-avr.c;h=702719136d09acbc8c98ec49ab8129d0f33fffa8;hb=HEAD#l2738
gnu-ld also replace JMP to a pair of RJMP+NOP, since it does save one CPU cycle, because the NOP is never executed, it is just a padding word.
ping ...
- relax long jump to short jump + nop does can save 1 CPU cycle;
- GNU-ld also does this optimization, as
https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=bfd/elf32-avr.c;h=702719136d09acbc8c98ec49ab8129d0f33fffa8;hb=HEAD#l2721
https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=bfd/elf32-avr.c;h=702719136d09acbc8c98ec49ab8129d0f33fffa8;hb=HEAD#l2738
Increasing the number of instructions, while it may improve performance for some processors (any number?), doesn't look right...
- On all devices, long jump costs 4 bytes, and short jump + nop also cost four bytes, there is no space expansion or shrink.
- As AVR instruction manual indicates, short jump costs 2 cycles, while long jump costs 3 cycles. So one CPU cycle is saved.
For example,
long jump _foo ; this is an unconditional jump which costs 4 bytes and 3 cpu cycle ... short jump _foo; this is an unconditional jump which costs 2 bytes and 2 cpu cycle nop ; this `nop` is just for padding the space of the replaced `long jump` , it is never executed.
In the above contrast, the nop is never executed, so short jump + nop does not waste any space, but saves one CPU cycle.
lld/ELF/Arch/AVR.cpp | ||
---|---|---|
258 | This NOP is just for padding, actually we need not handle it and left it unchanged. (However this will make llvm-objdump show an <unknown>) |
I can give up this patch. At least it is a tiny optimization, and have little affect on lld. But can my another patch https://reviews.llvm.org/D147364 be reviewd and accepted ?
R_AVR_LO8_LDI_GS/R_AVR_HI8_LDI_GS are still missing in lld, with them implemented, lld will be fully functional as GNU-ld, and can finally replace GNU-ld. My aim of clang+llvm+compilerrt+lld fully replace gnu toolchain, can be achieved. I really appreciate for that !
Is there any better we can remove this nop ? I can only figure out the way by memcpy uint8_t *loc .