32-bit shift instructions were previously expanded using the default SelectionDAG expander, which meant it used 16-bit constant shifts and ORed them together. This works, but is far from optimal.
The patch here uses a custom instruction inserter to insert optimized constant 32-bit shifts. This is done using three new pseudo-instructions that take the upper and lower bits of the value in two separate 16-bit registers and outputs two 16-bit registers.
This change results in around 31% less instructions on average for constant 32-bit shifts, and is in all cases equal or better than the old behavior. It also tends to match or outperform avr-gcc: the only cases where avr-gcc does better is when it uses a loop to shift, or when the LLVM register allocator inserts some unnecessary movs. But it even outperforms avr-gcc in some cases where avr-gcc does not use a loop.
As a side effect, non-constant 32-bit shifts also become more efficient.
For some real-world differences: the build of compiler-rt I use in TinyGo becomes 2.7% smaller and the build of picolibc I use becomes 0.9% smaller. I think picolibc is a better representation of real-world code, but even a ~1% reduction in code size is really significant.
I have tested this patch in simavr, with each of the 3 kinds of shifts and each of the 30 shift amounts that can be used.
Future patches can improve on this:
- A loop can often result in reduced code size at the expense of speed. We should use a loop when the minsize attribute is set and a loop would need fewer instructions.
- For some reason the register allocator inserts some unnecessary moves. I think this can be worked around by fiddling with REG_SEQUENCE, but I'm not sure. A better solution might be to try and split 16-bit instructions into 8-bit instructions before register allocation (something which I've tried a few times but haven't found a good solution for yet).
- Because the main algorithm is independent of the number of registers, the same code can be used to emit moves for other shift widths (8, 16, 64, and others if needed). But this can be done in later patches if needed.
I mostly based myself on an investigation I did a while ago: https://aykevl.nl/2021/02/avr-bitshift
LSLW is a bit confusing with LSLWN, so how about LSLWP, which means LSLW-Pair.