This is an archive of the discontinued LLVM Phabricator instance.

Fix return sequence on armv4 thumb
ClosedPublic

Authored by jroelofs on Jul 31 2014, 3:51 PM.

Details

Diff Detail

Event Timeline

jroelofs updated this revision to Diff 12088.Jul 31 2014, 3:51 PM
jroelofs retitled this revision from to Fix return sequence on armv4 thumb.
jroelofs updated this object.
jroelofs edited the test plan for this revision. (Show Details)
jroelofs added a reviewer: t.p.northover.
jroelofs set the repository for this revision to rL LLVM.
jroelofs added a project: deleted.
jroelofs added a subscriber: Unknown Object (MLST).

Hi Jon,

See my comments inline.

Also, please add some tests with ARMv4T and ARMv5T variants, checking for their specific returns, using FileCheck.

Thanks!
--renato

llvm/lib/Target/ARM/Thumb1FrameLowering.cpp
474 ↗(On Diff #12088)

Since you're already creating a new boolean, why not make it IsV4TReturn and set it inside the conditional below, so that your special lowering only checks for a flag and not a bunch of them.

475 ↗(On Diff #12088)

A comment here hinting at the block you just added below would be nice. Something like:

// ARMv4T requires BX, see below
495 ↗(On Diff #12088)

This is probably going to create two POPs. I think a better way would be to detect the LR in the loop above and change it from PC to R3 and just add the BX magic here. It should be safe to assume R3 is fine, but you can add a check if you want.

rengolin removed a subscriber: rengolin.
t.p.northover edited edge metadata.Aug 1 2014, 4:40 AM

Hi Jon,

I had a bit of a think about the liveness thing I mentioned. I think it's mostly OK, but there is one issue.

Cheers.

Tim.

llvm/lib/Target/ARM/Thumb1FrameLowering.cpp
502–503 ↗(On Diff #12088)

I think R3 is potentially reserved for a return value. E.g.

define i128 @foo() {
 ret i128 0
}

It's rather an edge-case (particularly on v4t, but then v4t is one big edge case). But we probably shouldn't break it anyway.

The safest thing to do would be use the register scavenger. And if the machinery is wired up to restoreCalleeSavedRegisters (fingers crossed) the easiest way to do *that* is to just create a virtual register (even though it is post regalloc, it's got special handling in PrologEpilogInserter).

I'll add some tests, and look into the register scavenger thing. I think I'm going to hold off on coalescing the POPs though until I'm convinced that it's even possible to do.

Thanks for the quick review comments!

Jon

llvm/lib/Target/ARM/Thumb1FrameLowering.cpp
474 ↗(On Diff #12088)

Ok.

475 ↗(On Diff #12088)

Ok.

495 ↗(On Diff #12088)

Yes, it will create two POPs. ISTR that POPs require their register lists to be in ascending order with no duplicates, so I think swapping PC for R3 only works if reglist contains at most {r0,r1,r2,pc}. Please correct me here if I'm mistaken.

If I use a virtual register as Tim suggests, I think I'm going to have a hard time detecting the few cases where it is possible to coalesce these two POPs. Any suggestions on how to make that happen?

502–503 ↗(On Diff #12088)

Ohhhh, good point. What about i256? That would eat up _all_ the lo registers, leaving us with no option to pop the link address. I guess for the ABI to be sane, that would have to go on the stack...

rengolin added inline comments.Aug 1 2014, 8:05 AM
llvm/lib/Target/ARM/Thumb1FrameLowering.cpp
495 ↗(On Diff #12088)

That's a good point. If you can use LR, than the problem is solved. If not, let's leave the two POPs for now and create a FIXME comment to that effect.

502–503 ↗(On Diff #12088)

The AAPCS only allows you to use R0~R3, everything else goes on the stack. The confusion from using R3 comes from ARM's own docs suggesting it's a reasonable way of doing it (even not coalescing the POPs), but it fails to mention that large return values could use it.

I had a read on the old APCS and couldn't find any mentions and to why R3 should be used as a return register (though it says R0~R3 can be used for returning values). But since we don't follow the APCS any more, even on ARMv4T, I think we should discard any reference to that document.

Any other register would have to be spilled in the caller, which is just not viable. Tim's suggestion would be the only reasonable way of doing it as a generic case.

Without thinking too much, we may get away with POPing it into LR again, and BX LR. I haven't covered all tail-return / recursive cases, but it would be a way of not needing the register scavenger.

I had a read on the old APCS and couldn't find any mentions and to why R3 should be used as a return register (though it says R0~R3 can be used for returning values). But since we don't follow the APCS any more, even on ARMv4T, I think we should discard any reference to that document.

iOS is unfortunately stuck with the APCS.

Cheers.

Tim.

rengolin edited edge metadata.Aug 1 2014, 8:27 AM

*gasp* *cough* *spit*... er... ok.

Anyway, AAPCS is more strict anyway, so whatever we do, must follow both.

jroelofs added inline comments.Aug 1 2014, 8:30 AM
llvm/lib/Target/ARM/Thumb1FrameLowering.cpp
495 ↗(On Diff #12088)

Sure, I'll add a FIXME.

502–503 ↗(On Diff #12088)

Unfortunately POP cannot write into LR in thumb1.

Does a register scavenger take care of spilling if it cannot find a free register?

jroelofs updated this revision to Diff 12129.Aug 1 2014, 5:38 PM
jroelofs edited edge metadata.

I found another sequence that will work for the case where we would be clobbering r3 of the return value.

pop {r4, r5, r6, r7}
mov ip, r3
pop {r3}
mov lr, r3
mov r3, ip
bx lr

This is slightly less horrible than all the alternatives I've been able to come up with... correctness first, then performance later.

With this patch, I've fixed both the varargs and non-varargs cases, and added a test for the latter. I still need to write one for the former, but it is exercising the same logic.

Jon,

You probably have thought of it, but what's wrong with just "pop"ing it manually?

pop {r4, r5, r6, r7}
ldr lr, sp
add sp, #4
bx lr
ldr lr, [sp]

I don't think there's any Thumb1 load into registers above r7.

Cheers.

Tim.

Of course. I keep forgetting Thumb1 is that restrictive, and the v5 manual is not particularly descriptive. :(

The generated code looks horrible, but I guess that's Thumb1's fault, I can't think of anything better. I also like the conditional between larger return values and v5T vs. v4T lowering.

I have no more comments, if Tim agrees, this patch looks good to go to me.

cheers,
--renato

jroelofs updated this revision to Diff 12167.Aug 4 2014, 10:39 AM

Add tests for varargs functions.
Fix stack pointer update (accidentally dropped this).

jroelofs updated this revision to Diff 12168.Aug 4 2014, 10:47 AM

Beef up the sp checks a little more.

rengolin accepted this revision.Aug 5 2014, 7:06 AM
rengolin edited edge metadata.

LGTM, thanks!

--renato

This revision is now accepted and ready to land.Aug 5 2014, 7:06 AM
jroelofs closed this revision.Aug 5 2014, 10:22 AM

Committed in r214881. Thanks!

Jon