This is an archive of the discontinued LLVM Phabricator instance.

[AVR] Optimize constant 32-bit shifts
AbandonedPublic

Authored by aykevl on Nov 22 2022, 3:50 PM.

Details

Summary

32-bit shift instructions were previously expanded using the default SelectionDAG expander, which meant it used 16-bit constant shifts and ORed them together. This works, but is far from optimal.

The patch here uses a custom instruction inserter to insert optimized constant 32-bit shifts. This is done using three new pseudo-instructions that take the upper and lower bits of the value in two separate 16-bit registers and outputs two 16-bit registers.

This change results in around 31% less instructions on average for constant 32-bit shifts, and is in all cases equal or better than the old behavior. It also tends to match or outperform avr-gcc: the only cases where avr-gcc does better is when it uses a loop to shift, or when the LLVM register allocator inserts some unnecessary movs. But it even outperforms avr-gcc in some cases where avr-gcc does not use a loop.

As a side effect, non-constant 32-bit shifts also become more efficient.

For some real-world differences: the build of compiler-rt I use in TinyGo becomes 2.7% smaller and the build of picolibc I use becomes 0.9% smaller. I think picolibc is a better representation of real-world code, but even a ~1% reduction in code size is really significant.

I have tested this patch in simavr, with each of the 3 kinds of shifts and each of the 30 shift amounts that can be used.

Future patches can improve on this:

  • A loop can often result in reduced code size at the expense of speed. We should use a loop when the minsize attribute is set and a loop would need fewer instructions.
  • For some reason the register allocator inserts some unnecessary moves. I think this can be worked around by fiddling with REG_SEQUENCE, but I'm not sure. A better solution might be to try and split 16-bit instructions into 8-bit instructions before register allocation (something which I've tried a few times but haven't found a good solution for yet).
  • Because the main algorithm is independent of the number of registers, the same code can be used to emit moves for other shift widths (8, 16, 64, and others if needed). But this can be done in later patches if needed.

I mostly based myself on an investigation I did a while ago: https://aykevl.nl/2021/02/avr-bitshift

Diff Detail

Event Timeline

aykevl created this revision.Nov 22 2022, 3:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 22 2022, 3:50 PM
Herald added subscribers: Jim, hiraditya. · View Herald Transcript
aykevl requested review of this revision.Nov 22 2022, 3:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 22 2022, 3:50 PM
aykevl edited the summary of this revision. (Show Details)Nov 22 2022, 3:53 PM
aykevl edited the summary of this revision. (Show Details)Nov 22 2022, 4:05 PM
aykevl updated this revision to Diff 477359.Nov 22 2022, 5:57 PM
  • fixed two tests that were failing as a result of this change
  • some minor other changes
aykevl updated this revision to Diff 478103.Nov 27 2022, 8:03 AM
  • rebase (no change in patch) - hopefully the pre-merge check now works?

I will have a look next week.

For your commit message, I think the front part is OK, and the other part from Future patches can improve on this: till the end needs not to be committed.

benshi001 added inline comments.Dec 4 2022, 11:40 PM
llvm/lib/Target/AVR/AVRISelLowering.cpp
286

This message should be Expected power-of-2 shift operand, you can also fix it along with your patch..

llvm/lib/Target/AVR/AVRISelLowering.h
42

LSLW is a bit confusing with LSLWN, so how about LSLWP, which means LSLW-Pair.

benshi001 added inline comments.Dec 5 2022, 2:30 AM
llvm/lib/Target/AVR/AVRISelLowering.cpp
1963

How about using GetZeroRegister() instead of fixed AVR::R1? This way should also fit avrtiny. Or we can emit eor Rx, Rx instead of mov Rx, Zero .

benshi001 added inline comments.Dec 5 2022, 2:33 AM
llvm/lib/Target/AVR/AVRISelLowering.cpp
1866

How about using GetZeroRegister() instead of fixed AVR::R1? This way should also fit avrtiny. Or we can emit eor Rx, Rx instead of mov Rx, Zero .

1919

How about using GetZeroRegister() instead of fixed AVR::R1? This way should also fit avrtiny. Or we can emit eor Rx, Rx instead of mov Rx, Zero .

1990

How about using GetZeroRegister() instead of fixed AVR::R1? This way should also fit avrtiny. Or we can emit eor Rx, Rx instead of mov Rx, Zero .

Is it possible to break this large patch to smaller ones, in which the first one is just the skeleton (with only shiftAmount = 1 implemented as the basic situation) ? Then later ones implment each special shiftAmount case .

benshi001 added inline comments.Dec 5 2022, 4:14 AM
llvm/lib/Target/AVR/AVRISelLowering.cpp
291
llvm_unreachable("Expected a constant shift amount!");
302

It is OK that ShiftAmount have a MVT::i8 type ? Could it be better to MVT::i16 ?

314

Would it better to use std::map or equivalent but more efficient llvm utilities?

aykevl updated this revision to Diff 480081.Dec 5 2022, 5:54 AM
aykevl set the repository for this revision to rG LLVM Github Monorepo.
  • use getZeroRegister instead of AVR::R1
  • use AVR::sub_lo and AVR::sub_hi instead of numeric constants
  • optimize REG_SEQUENCE
aykevl marked 4 inline comments as done.Dec 5 2022, 6:05 AM

Thank you for taking a look! I will update the patch later with your suggestions. For now, I've updated the patch a bit with changes I made locally in the past few days.
(I will also update the patch with extra context that I forgot to add this time).

llvm/lib/Target/AVR/AVRISelLowering.cpp
302

32-bit shift instructions will shift at most 31 bits, so i8 is good enough.

314

Do you know of any?
Other code seems to use a switch and I think this is very readable.

1866

This was already fixed in my local changes (this patch is older than D138582).

aykevl marked an inline comment as done.EditedDec 5 2022, 6:09 AM

Is it possible to break this large patch to smaller ones, in which the first one is just the skeleton (with only shiftAmount = 1 implemented as the basic situation) ? Then later ones implment each special shiftAmount case .

I didn't split the patch because this way, there is no code size regression. But if you prefer I can split the patch in multiple patches and commit them all at the same time once accepted.

EDIT: yeah I agree, splitting this patch is better. I'll do that in the next update.

benshi001 added inline comments.Dec 5 2022, 6:43 PM
llvm/lib/Target/AVR/AVRISelLowering.cpp
314

Though switch is more readable, it seems boring, in my opinion.

So I suggest

std::map<unsigned, unsigned> OpMap = {{ISD::SHL, AVRISD::LSLW},
                                      {ISD::SRL, AVRISD::LSRW}, 
                                      {ISD::SRA, AVRISD::ASRW}};
assert(OpMap.find(Op.getOpcode()) != OpMap.end() &&
       "Unexpected shift opcode");
unsigned Opc = OpMap[Op.getOpcode()];

which looks more clear.

benshi001 added inline comments.Dec 5 2022, 8:04 PM
llvm/lib/Target/AVR/AVRISelLowering.cpp
2146

Have your supplemented tests covered all those special conditions ?

benshi001 added inline comments.Dec 5 2022, 11:45 PM
llvm/test/CodeGen/AVR/shift32.ll
2

Also check for AVRTiny ?

benshi001 added inline comments.Dec 6 2022, 12:17 AM
llvm/lib/Target/AVR/AVRISelLowering.cpp
1843

I think using negative ShiftAmt for left shift is so counter-intuitive, it would be better to use one more argument bool leftShift, or direct expose the Opcode {shl, lshr, ashr}.

benshi001 added inline comments.Dec 6 2022, 12:24 AM
llvm/lib/Target/AVR/AVRISelLowering.cpp
1861

Renaming ShiftBytes to ShiftRegsSize looks better.

1906

Renaming ShiftBytes to ShiftRegsSize looks better.

benshi001 added inline comments.Dec 6 2022, 12:36 AM
llvm/lib/Target/AVR/AVRISelLowering.cpp
1855

Would it be better to split this large function insertMultibyteShift to small ones for each if statment ? for example, this if can be a standalone function insertMultibyteShiftMod6_7 (or any better name).

benshi001 added inline comments.Dec 7 2022, 2:54 AM
llvm/lib/Target/AVR/AVRISelLowering.cpp
1934

This if statement is not tested for ashr. It would be better to add an ashr6 or ashr30 test.

benshi001 added inline comments.Dec 7 2022, 2:58 AM
llvm/lib/Target/AVR/AVRISelLowering.cpp
1949

Ext and ExtMore seem confusing. May it be better that

Ext -> HighByte
ExtMore -> ExtByte

benshi001 added inline comments.Dec 7 2022, 3:07 AM
llvm/lib/Target/AVR/AVRISelLowering.cpp
2008

So here should be an assert((ShiftAmt > -8 && ShiftAmt < 8) && "Unexpect shift amount");

benshi001 added inline comments.Dec 7 2022, 3:41 AM
llvm/lib/Target/AVR/AVRISelLowering.cpp
2004

I am a bit confused about here. With Regs = Regs.slice(1, Regs.size() - 1);, so the 0 index element is dropped, and only 3 elements left in Regs ?

aykevl added a comment.EditedDec 22 2022, 11:30 AM

I created a series of patches to replace this one, for easier reviewing. Thank you for the thorough review!
See: D140569, D140570, D140571, D140572, D140573

llvm/lib/Target/AVR/AVRISelLowering.cpp
314

I'm not really convinced this is more readable. I like boring code, it means it is easy to understand.

1843

Hmm, you have a point here. But I kind of like using the signedness of the value.
What do you think of renaming it to ShiftRightAmt instead? Then it's clearer that a negative number will be a left shift.
I prefer to avoid opcodes because that would make the code less generic (I want to write a 64-bit shift later, for example).

1855

What would be the benefit of that?

1861

👍 sounds good

1934

It is, it is tested by ashr_i32_30. I have added a few more tests in my updated code.

1949

Fair enough. I'll update the code with these new names and some extra comments to explain what they do.

2004

Yes. I'll update the code to use drop_front and drop_back instead.

2008

👍 seems fine by me.

2146

Not sure, I did test all cases locally (all 93 cases) and this does not affect or improves code size in all cases. You can see a number of tests in D140573 that are optimized this way.
In any case, this is an optimization. As long as correct code is generated in both cases, the condition is purely a heuristic.

llvm/test/CodeGen/AVR/shift32.ll
2

The only difference is the lack of movw (which isn't emitted by this patch, instead is uses AVR::COPY which is lowered to either movw or two mov instructions depending on support). All other instructions are supported by avrtiny. So I do not think testing for AVRTiny is necessary here.

(I could add it of course if you think it's useful, but the test output will likely be near-identical).

You can close this patch.

aykevl abandoned this revision.Jan 6 2023, 12:21 PM

Closing, this has been replaced by multiple smaller patches.