This is an archive of the discontinued LLVM Phabricator instance.

[AVR] Expand STDWSPQRr & STDSPQRr, approach #2

Authored by Patryk27 on Nov 25 2021, 1:12 PM.



This takes a different approach to solve instead of hacking around isFrameInstr(), I've moved parts of the logic to AVRMachineFunctionInfo which allowed me to move the registry-checking stuff directly into the AVRExpandPseudo pass, simplifying rest of the code; I think it's the cleaner approach for the issue posted on that MR.

This is my first merge request to LLVM, so probably the quality is subpar, but I'm ofc. open to any comments; thanks! :-)

Diff Detail

Event Timeline

Patryk27 created this revision.Nov 25 2021, 1:12 PM
Patryk27 requested review of this revision.Nov 25 2021, 1:12 PM
Patryk27 edited the summary of this revision. (Show Details)Nov 25 2021, 1:12 PM
Patryk27 added inline comments.Nov 25 2021, 1:15 PM

just saw there's MBBI.getParent()->getParent() (or something like that) that returns associated function, too

84 ↗(On Diff #389870)

passing MachineFunction "back in here" feels like cheating, but I can't see another way at the moment

Patryk27 added inline comments.Nov 25 2021, 1:22 PM

also, this ofc. needs to be refactored to use the EXPAND macro below

Patryk27 marked 2 inline comments as not done.Nov 25 2021, 1:24 PM
Patryk27 updated this revision to Diff 390220.Nov 28 2021, 11:23 AM
dylanmckay requested changes to this revision.Nov 29 2021, 4:02 AM

Hey @Patryk27, very nice patch, I would not have guessed it was your first LLVM PR.

I've left a few style comments. I need to review it a bit deeper for correctness which I plan to do during the week sometime.

Also, I noticed this patch uses the same-ish base commit from master as (from 6 months or so ago), which means it doesn't actually cleanly apply on LLVM master and has a merge conflict in AVRFrameLowering.cpp and so it needs a rebase on LLVM main branch

Cheers for the patch @Patryk27


Recommend you split the logic STI.getFrameLowering()->hasReservedCallFrame(MF) ? AVR::R29R28 : AVR::R31R30 into a method, with some reasonable name like `getStackStoreBaseRegister


The Block type is actually MachineBasicBlock (I would link you to the source code, but GitHub is having an outage at the moment ....) suffice to say it the typedef at line 47 of this file

MI.getParent() is a machine basic block hidden by the typedef (always the parent of current element of MBBI)

So MI.getParent() == MBB

And MI.getParent()->getParent() == MBB.getParent()

The suggestion is to use MBB.getParent() rather than MBBI.getParent()->getPArent()


Minor: Move this above next to STDWPtrQRr so similar operations are similarly ordered

1 ↗(On Diff #390220)

This test is good but it would be good to also have a mir test as well (like the other pseudo instruction tests in the same folder) to make sure that the instructions STDSPQRr and STDWSPQRr are always covered. Technically, the current test could become a false positive in the future if the instruction selection layer changes and the test passes bypassing the instruction you meant to test. Pretty unlikely in this situation though. But it would still be good to have it just to make sure it is covered with an MIR test too.

6 ↗(On Diff #390220)

The '6th sub-basic block' seems liable to change, I think it should be changed to a regex so that it doesn't break if LLVM common changes the generated basic block structure or order as does sometimes happen (see )

What might be a better idea: add an LLVM IR inline ASM statement with a NOP or something at the top of the function body. The generated ASM must have this inline ASM in its output, and it must also be after the function epilogue, therefore you could do CHECK: std Y+13, r18 and then CHECK-NEXT: std Y+14, r19 and then CHECK-NEXT: nop

But on the other hand, is the basic block CHECK actually useful? Perhaps we only need the two instruction checks.

Suggestion is: either make the basic block CHECK-LABEL use regex. or remove it. Either works

This revision now requires changes to proceed.Nov 29 2021, 4:02 AM
dylanmckay added inline comments.Nov 29 2021, 4:03 AM

I thought I deleted this comment as I changed my mind. Please ignore - there is no need to split it into a method

Patryk27 added inline comments.Nov 29 2021, 4:43 AM
1 ↗(On Diff #390220)

Technically, the current test could become a false positive in the future if the instruction selection layer changes and the test passes bypassing the instruction you meant to test.

True, true - how about I move this file to llvm/test/CodeGen/AVR/bug-$issuenumber.ll (not to lose the context of "this used to be an edge case"), and then create two separate MIR tests for those opcodes?

dylanmckay added inline comments.Nov 30 2021, 5:25 PM
1 ↗(On Diff #390220)

Good idea, yup

Patryk27 updated this revision to Diff 391090.EditedDec 1 2021, 11:12 AM
Patryk27 marked 3 inline comments as done.

Thanks for the review! Seizing the day, I've moved all of the regression tests to a common directory called regression - I think it's a bit tidier this way 🙂

Patryk27 added inline comments.Dec 1 2021, 11:17 AM

Besides adding an early exit a few lines above, this logic was not changed - Phabricator seems to have a hard time figuring out whitespaces in the diff


Shouldn't those two also have implicit-dif $sp? It looks like the codegen drops it somewhere.

Rahix added a subscriber: Rahix.Dec 2 2021, 12:20 PM
Patryk27 updated this revision to Diff 391843.Dec 4 2021, 9:40 AM
jwagen added a subscriber: jwagen.Dec 6 2021, 1:49 AM

Thanks for your work on this @Patryk27.

I've tested the change set by cherry picking onto the Rust fork and compiling a suite of example programs. All of my verification has passed without error.

I will continue to try to come up with a test case that exercises the potential issue I raised inline, but please do let me know if I've missed something there. Thanks again!


Perhaps I'm missing something, but where is Z being initialized? I'm happy with the use of Y (I'm fine trusting the comment in the old code that Y is guaranteed to hold a copy of SP, though maybe I shouldn't just trust that).

The old code initialized Z immediately prior to converting the pseudo stores into real stores. I can't see the same guarantee being made here. Do we know that Z still holds the adjusted frame pointer value? At first glance it seems like intervening instructions could easily clobber it.

Patryk27 added inline comments.Dec 12 2021, 9:12 AM

That's a great question, and the answer is: I'm not sure 😅

On a parallel matter, I wonder why we're using the Z in the first place - couldn't we always refer to R29R28? Seems like that's what the other backends (e.g. RISC-V) do.

dylanmckay added inline comments.Dec 15 2021, 4:27 AM

note to self (and others if they care)

considerations I haven't verified yet:

  • AVRFrameLowering, the original location of this code, is earlier on in the codegen process than AVRExpandPseudoInsts. Before this patch, the allocation of Z and the replacements of the stack pointer with it were isolated to the "frame setup" pass(es) of LLVM. After this patch, the allocation of Z stays in the frame setup pass, but the substitutions of it are moved to a later AVRExpandPseudos pass. It's conceivable that other passes may clobber the valid ranges of that the Z instruction contains the frame pointer and inserting another usage of it later in the process will actually read an invalid valid. I haven't verified this yet

It is safe to assume that the existing code is correctly initializing Y as the frame pointer, it does so here:

The main complication around the frame pointer in the existing backend which is what I was worried about (not anymore), is some special hax in the somewhere in the backend that detects when the register allocator has allocated the Y register already but then subsequently figures out the function needs stack spills, which then forbids the previously valid allocation of the now-FP Y (the backend fixes this clash somehow). But this patch doesn't touch any of that so no worries there.

The old code initialized Z immediately prior to converting the pseudo stores into real stores. I can't see the same guarantee being made here. Do we know that Z still holds the adjusted frame pointer value? At first glance it seems like intervening instructions could easily clobber it.

Yes, I agree.

Obvious prerequisite context: at function entry, the stack pointer register (SP) is the frame pointer. The stack pointer never disappears.

If the function has no variable stack allocations (= a fixed stack frame size), then you don't actually need to have a frame pointer register to operate on the stack because you know the size of the stack (therefore the layout of the stack too) so you if you wanted to read/write to the frame within the function body, you could achieve this without a dedicated frame pointer register by directly reading/writing from SP into a temp reg, adjusting the pointer as needed as you know the stack layout, and executing the memory operation). If you were to write the assembly this way, you get the benefit of having 33.333..% extra pointer variables available to your code, because you never reserved Z(R28R29) for the frame pointer. The downside however is that every stack operation inside a function without a dedicated FP is now maybe 3-4 instructions rather than 1, etc.

The AVR backend does this in many cases. All functions have access to SP register no matter what, obvious. Some functions have access to a dedicated frame pointer register
The AVR backend will dedicate Y(R28R29) as the dedicated frame pointer, and initialize it in function prologue, if and only if a dedicated frame pointer is "needed". Currently, a dedicated frame pointer is needed if these conditions are satisfied.

So, if there are no spills, no allocas, and no arguments passed on the stack: then you won't have a dedicated frame pointer and if you wish to access the stack you'll need to manually store/save SP into a temp register of your choosing and do the operation that way. This is what the old code you're replacing was doing here. At the very start and very end of the block, Z(R31R30) gets manually assigned as a temporary "frame pointer" register, for the duration of the function epilogue basic block that is passed into fixStackStores at the end.


So, the Z register can be considered a valid "temporary" frame pointer register and be used as such as you are doing. The key point though is that this substitution can only be done if: the function has no dedicated frame pointer (because only then will AVRFrameLowering.cpp store the stack pointer in Z register contain the frame pointer (that's still true after your patch) AND if you know for sure that the stack store/load instruction you're expanding originates from and only from the AVRFrameLowering.cpp if (Opcode == TII.getCallFrameSetupOpcode()) conditional that is actually setting Z.

Perhaps we can add some assertion to this function that ensures that you're only doing the Z substitution if you know the previous criteria I mentioned are valid (no dedicated FP + the instruction must have come from the if (Opcode == TII.getCallFrameSetupOpcode()) { conditional below. There might be a better way to do it tho, I'm still thinking

dylanmckay added inline comments.Dec 15 2021, 4:31 AM

because you never reserved Z(R28R29)

typo, should be:

because you never reserved Y(R28R29)

1 ↗(On Diff #391843)

I like the idea of moving the "bug coverage" tests to a subdir.

What do you reckon about renaming the directory past-bugs or something else rather than regression. The latter feels to generic - all of the tests, even outside the AVR directory, are there for regression so the name feels a bit meaningless.

Patryk27 added inline comments.Dec 15 2021, 11:32 AM

Hmm, I think that one thing still escapes me: in here - - why couldn't we just always operate on R29R28, forgetting about the Z register entirely? (assuming we expanded STDWSPQRr to always access Y, that is.)

I've been looking at the other backends and I can't for the life of me figure out why we need a second register whatsoever 🤔

1 ↗(On Diff #391843)

I've named it regression after AMDGPU's one, but sure - past-bugs sounds fine, too :-)

arsenm added inline comments.Dec 15 2021, 4:12 PM
1 ↗(On Diff #391843)

AMDGPU does not have regression subdirectory. Also I don't see any clear distinction between a regression test and any other

Patryk27 added inline comments.Dec 15 2021, 11:08 PM
1 ↗(On Diff #391843)


I was under the impression that a "regression test" meant a "test for a past bug" (under which condition e.g. our or.ll wouldn't be technically a regression test, but a typical unit / integration one). But yeah, seems like this phrase has a more general meaning that I thought.

Patryk27 added inline comments.Dec 15 2021, 11:09 PM
1 ↗(On Diff #391843)

than I thought*

cbgbt added a subscriber: cbgbt.Dec 16 2021, 12:31 AM
Patryk27 added a comment.EditedDec 17 2021, 11:52 AM

Edit: this is a wrong lead; please see:

So I think something this patch does is not right - I'm not sure how to prepare a minimized, llvm-only example, but if you try to run this code:


use atmega_hal::{pins, usart::Usart0, Peripherals};
use core::hint::black_box;
use panic_halt as _;
use ufmt::*;
use void::{ResultVoidExt, Void};

pub type Clock = atmega_hal::clock::MHz20;

fn main() -> ! {
    let dp = Peripherals::take().unwrap();
    let pins = pins!(dp);

    let mut serial =
        Usart0::<Clock>::new(dp.USART0, pins.pd0, pins.pd1.into_output(), 115200.into());

    fun(&mut serial, 23, 23);
    fun(&mut serial, 24, 24);
    fun(&mut serial, 25, 25); // commenting-out this line makes the code work

    loop {}

fn fun(serial: &mut dyn uWrite<Error = Void>, x: u32, y: u32) {
    let _ = black_box(serial);
    let _ = black_box(x / y);
atmega-hal = { git = "", features = ["atmega328p", "rt"] }
embedded-hal = "0.2"
panic-halt = "0.2"
ufmt = "0.1"
void = { version = "1", default-features = false }

panic = "abort"
codegen-units = 1
opt-level = 1
lto = false

... you'll see that the 25 / 25 division triggers what seems to be a CPU reset, and in reality is caused by a ret on a corrupted stack frame.

I've managed to nail the problem down to a miscompilation in __udivmodsi4 (, as if the ldd instructions before st / std were removed (?), but I'm not sure how to investigate it further 😅

Patryk27 added a comment.EditedDec 18 2021, 3:32 AM

I've got some progress! (that's division as compiled from; sorry for pastebin, but I wanted to avoid spamming this thread with even more code)

llc-ing that code with -march=avr returns what seems to be valid code, but with -march=avr -mcpu=atmega328p the st / std look shady:

.LBB0_60:                               ; %_ZN17compiler_builtins3int19specialized_div_rem11u32_div_rem17h69a9140ee9b21fa0E.exit.i
        ldi     r24, 0
        cpi     r30, 0
        cpc     r31, r24
        breq    .LBB0_62
; %bb.61:                               ; %bb2.i
        st      Z, r12
        std     Z+1, r13
        std     Z+2, r6
        std     Z+3, r7

The stack-corruption flow goes: ldi -> cpi -> cpc -> breq (not taken) -> st (and here we refer to invalid Z (?)).

Edit 1:
On a second thought, that very part seems to be alright - it refers to:

if let Some(rem) = rem {
    *rem = quo_rem.1;

So the stack corruption must be caused by r30 & r31 containing trashy values, not the contents of the rem parameter, as the codegen expects. With -opt-bisect-limit=0 the output looks "more correct" (although I haven't executed it yet):

.LBB0_64:                               ; %_ZN17compiler_builtins3int19specialized_div_rem11u32_div_rem17h69a9140ee9b21fa0E.exit.i
	ldi	r24, 0
	cpi	r16, 0
	cpc	r17, r24
	breq	.LBB0_66
	rjmp	.LBB0_65
.LBB0_65:                               ; %bb2.i
	movw	r30, r16
	st	Z, r8
	std	Z+1, r9
	movw	r30, r16
	std	Z+2, r6
	std	Z+3, r7
	rjmp	.LBB0_66

Edit 2:
Seems like the peephole optimizer overwrites usages of r16:r17 with r30:r31, but it doesn't do it everywhere (?), causing some parts of the code to refer to r30:r31 thinking it contains rem, while in reality it doesn't.

Edit 3:
Stepping-through the code with GDB proves that the optimization is legal; what's curious though is that it looks like the rem argument gets scrambled even before the function is ran!

It looks like the numbers passed in here:

fun(&mut serial, 25, 25);

... get mistaken for the rem parameter! So it's as if performing 25 / 25 tried to do:

*((*uint16_t) 0x19) = something;

... which just happens to overwrite the stack; for instance 30 / 30 works, since it happens to refer to an unused memory region.

Edit 4:
Hmm, when expanding LLVM's div, shouldn't we pass 0 as the third argument, to zero-out rem? Seems like we're not doing that, which means that if r16:r17 happen to contain a non-zero value...

Edit 5:
I guess expansions for division & remainder were designed with avr-gcc's runtime in mind, right ( That'd explain the missing zeroing-out :-)

Edit 6:
🤦 rustc now ignores the no-compiler-rt field in target.json, which means that AVR builds are being linked with compiler-rt instead of the avr-gcc's runtime; well, what a journey - from a possible bug in the codegen straight to "nah, it's a linker issue" 😅


jonnor added a subscriber: jonnor.Jan 12 2022, 2:34 AM
Patryk27 added a comment.EditedJan 18 2022, 2:50 AM

I've just found a small bug with our current approach: if we generate STDWPtrQRr from within AVRExpandPseudo, then we don't have a chance to run AVRRelaxMem on that newly-created instruction (so, bottom line, functions with larger stacks trigger assert(Imm <= 62 && "Offset is out of range");).

Maybe we should merge AVRRelaxMem with AVRExpandPseudo or we should have a separate pass for STDWPtrQRr?

Patryk27 updated this revision to Diff 402203.Jan 22 2022, 4:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 22 2022, 4:45 AM
Patryk27 added inline comments.Jan 22 2022, 4:47 AM

I'm not sure if that's correct, but the previous approach:

    .addDef(Ptr.getReg(), getKillRegState(Ptr.isKill()));

... wasn't correct either, since it's not possible to addDef() a killed register (this crashes LLVM, but there was no test to check it on our side).

Patryk27 marked an inline comment as not done.Jan 22 2022, 4:48 AM
Patryk27 updated this revision to Diff 402213.Jan 22 2022, 6:49 AM
myhsu added a reviewer: myhsu.Mar 15 2022, 3:30 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 15 2022, 3:30 PM

It would be better to write a clear commit message with clear title.

[AVR] Expand pseudo instructions STDWSPQRr & STDSPQRr

1. What is the issue / Why there is a failure in instruction selection ?
2. How you fix it.

This is just my suggestion, a unique clear commit message is necessary when committing to the main branch, nobody would like to track links to learn all about your work.

I find some test files locations are moved, I suggested you create a seperate patch for that, and this patch only focuses on fixing the instruction selection failure.

Also I can help review and commit the other patch.

benshi001 added inline comments.Mar 19 2022, 5:00 AM

I suggest we make another patch for merge AVRRelaxMem::relax<AVR::STDWPtrQRr> into expand<AVR::STDWPtrQRr>, for the case Imm >= 63. And we select that merge patch as baseline / parent of current patch.

benshi001 added inline comments.Mar 19 2022, 6:19 AM

Why we need an extra Dst local variable here? I did not find there is any more use besides .getReg() .

benshi001 added inline comments.Mar 19 2022, 6:32 AM

Also, I did not find any other use of Src

benshi001 added inline comments.Mar 19 2022, 7:10 AM
        .addReg(DstReg, RegState::Define)
        .addReg(DstReg, RegState::Kill)

The original DstReg is indeed killed.

benshi001 added inline comments.Mar 19 2022, 7:47 AM
    .addReg(Dst.getReg(), RegState::Kill)
    .addReg(Src.getReg(), getKillRegState(SrcIsKill));

No matter DstIsKill is true or false, the new DstReg is always killed.


I think

if (!DstIsKill)
   buildMI(MBB, MBBI, AVR::POPWRd).addReg(DstReg, RegState::Define);
benshi001 added a comment.EditedMar 19 2022, 10:14 PM

I summary my suggestion in general, excluding some inline comments about coding details.

  1. Seperate the location movement of test files into a different patch A.
  2. Seperate the elimination of AVRRelaxMemPass / combination into expand<AVR::STDWPtrQRr> to a different patch B.
  1. I also concern current solution is a bit agresssive, so I would like to suggest a moderate way.

3.1 Kill the function fixStackStores and its orginal calls.
3.2 Mark the definition of STDWSPQRr/STDSPQRr with Defs = [R31R30] (Most ordinary STDWSPQRr/STDSPQRr will be substituted before regalloc, so that is fine.)
3.3 Still implement expand<AVR::STDSPQRr> and expand<AVR::STDWPQRr> as current patch does, but do not substitute to AVR::STDPtrQRr/AVR::STDWPtrQRr, we should do real instruction expansion with several buildMI calls. Since we have marked Defs = [R31R30] to STDWSPQRr/STDSPQRr, it is safe to expand to

in r30, 62
in r31, 63
subiw z, offset
st z, Rsrc


in r30, 62
in r31, 63
std z+offset, Rsrc
Patryk27 marked 6 inline comments as done.Mar 26 2022, 12:57 PM

Thanks for the review!

For the time being, I have extracted the first set of changes into:


Right, I think it must have been an oversight on my side; I have fixed it in the follow-up merge request (

Patryk27 abandoned this revision.May 9 2022, 1:25 AM
Patryk27 marked 2 inline comments as done.

Changes in this revision were split into separate revisions, so this one is not needed anymore.