This is an archive of the discontinued LLVM Phabricator instance.

Add tail call optimization for thumb1-only targets rev. 3
Needs ReviewPublic

Authored by Langohr on Jan 15 2015, 12:22 PM.

Details

Reviewers
compnerd
jroelofs
Summary

For Tail calls identified during DAG generation, the target address will
be loaded into a register by use of the constant pool.
If R3 is used for argument passing, the target address is forced
to hard reg R12 in order to overcome limitations thumb1 register
allocator with respect to the upper registers.

During epilog generation, spill register restoring will be done within
the emit epilogue function. Three different cases are to be distinguished.

  1. If LR is not pushed on the stack. Then simply a BX is generated.
  2. If LR is pushed on the stack and R3 is available as scratch, LR is restored after pop { ... } for the remaining callee saved regs.
  3. If R3 is not available for LR restore, LR is restored before pop { ... } and the stack pointer is re-adjusted afterwards
  4. If all regs R0...R3 are used for function call parameters, The target address will be copied to R12.

For a cortex M0 I did count that the sequence 2) will take one cycle longer than a version based on BL / pop { ..., pc } without a tail call. Option 3 will be 2 cycles slower than a version without a tail call (additional SP+=4) and option 4 will be 3 cycles slower. Also, 2) and 3) and 4) will generate sligthly larger code (have a look at the test cases).

In discussions on llvm-dev some did argue that for this reason, tail call optimization should not be integrated as part of the default options. In my personal perception the spared precious stack memory is readily worth it.

Diff Detail

Event Timeline

Langohr updated this revision to Diff 18243.Jan 15 2015, 12:22 PM
Langohr retitled this revision from to Add tail call optimization for thumb1-only targets.
Langohr updated this object.
Langohr edited the test plan for this revision. (Show Details)
Langohr added a reviewer: jmolloy.
Langohr set the repository for this revision to rL LLVM.
Langohr added a subscriber: Unknown Object (MLST).

I think you should also add tests for tail-called functions that return things other than void (especially things wider than one register in size).

One more question: do these changes still work on armv4t when an arm function tail calls a thumb function (and vice versa)?

lib/Target/ARM/ARMISelLowering.cpp
1682

as mentioned in the other thread, these variable names feel quite long.

1688

I think the usual "llvm style" here is to put the &&'s at the end of the line, rather than at the beginning, and only break the line where it would go over 80 cols.

Might be worthwhile to clang-format the patch.

1712

unncessary whitespace

1713

ditto

lib/Target/ARM/Thumb1FrameLowering.cpp
420

would be better now to combine this line with the one below it.

568

Shorter name suggestion: MustRestoreR4.

580

This variable is never assigned to again.... Should pick one name for this, and stick with it, delete the other one.

587

A shorter name suggestion: LRRestoreReg.

596

Need a RegToUseForLRRestore = ARM::R4; here, otherwise it is used uninitialized later.

617

I think a better name for this would be "EmptyPop", with all of it's defs & uses inverted.

666

Get rid of the variable IsTailCallReturn and just check the condition here.

684

changes here do nothing... would be better to revert them. same with the added newlines.

test/CodeGen/ARM/twoParametersTailCall_v6m.ll
20

I think all of the attributes and debug information can be dropped. Most tests put the CHECK: lines inbetween the ret and the '}' at the end of the function.

jroelofs removed a subscriber: jroelofs.
Langohr updated this revision to Diff 18316.Jan 16 2015, 1:23 PM
Langohr retitled this revision from Add tail call optimization for thumb1-only targets to Add tail call optimization for thumb1-only targets rev. 3.
Langohr updated this object.

I think you should also add tests for tail-called functions that return things other than void (especially things
wider than one register in size).

OK. Is done.

One more question: do these changes still work on armv4t when an arm function tail calls a thumb function (and vice versa)?

Tail calls occuring in ARM code will not be affected by the change.
Tail calls occuring in THUMB code will use BX for the call so that from my understanding, the mode switch will happen according to bit #0. As much as I know, when resolving the jump target relocation for function entry point lables, bit #0 will be correctly set for thumb functions. I will double check that next week.

Langohr edited the test plan for this revision. (Show Details)Jan 16 2015, 2:13 PM

I've looked at the code of the register scavenger for thumb1.

There definitely *is* an issue. The register scavenger takes R12 without any further checks of usage.

We will need to implement some change in the scavenger to be on the safe side. Using the stack for scavenging will prove difficult, when considering possible alloca() uses.

Scavenging will never be necessary within the epilogue code, if the ldr rX, mov R12, rx sequence shows up just before the epilogue, we will not be having a problem. Therefore, One might look for a way for forcing the mov R12,rx to be the very last instruction before epilogue generation.

As minimum, I'd add checks for usage of R12 in the target scavenger in order to run into an assert instead of silently generating bad code.

The other option, I am seeing is to make the register scavenger use LR in case that LR is in the CSI list. LR will be pushed and poped as soon as any GPR is spilled and restored. In case of load address loading to R12 for tail calls, LR is always pushed. In this case, we may readily use LR instead of R12 for scavenging. The only issue that I may imagine for this approach is a possible use of __builtin_return_address that might try to get the address from LR and interfere with scavenging.

Langohr updated this revision to Diff 18345.Jan 17 2015, 4:55 AM
Langohr updated this object.
Langohr edited the test plan for this revision. (Show Details)
Langohr removed rL LLVM as the repository for this revision.

Added check for use of R12 in register scavenging. Try to fall-back to LR in case that R12 is used in the basic block where the emergency condition occurs.

When re-analyzing prologue and epilogue generation code for thumb1, it seems to me that there is another issue related to the fix done in r210889. The code in head still refers to DPR registers within a possibly available VFP unit and corresponding register spilling and restoring. For thumb1, DPR registers are, however, inaccessible. Prologue and epilogue, thus, should only consider AAPCS calling convention and not AAPCS-VFP. I doubt that this part of the code had shown up as part of copy-and-paste from thumb2/ARM code.?

Similarly, in thumb1 code, I find expressions like STI.isTargetMachO(). In my understanding MachO is only relevant for iOS systems. Is there really any iOS system requiring thumb1 CPUs?

jroelofs edited edge metadata.Jan 19 2015, 8:54 AM

I'd also like to see a test case where r7 is used for the scratch reg.

lib/Target/ARM/ARMISelLowering.cpp
1694

Control flow would be easier to read if you eliminate IsCallAddrRequiredInReg and just do the checks needed here.

test/CodeGen/ARM/twoParametersTailCall_v6m.ll
7

Now that you've dropped the metadata, you can also drop the #[0-9]'s

Langohr updated this revision to Diff 18402.Jan 19 2015, 1:50 PM
Langohr updated this object.
Langohr edited edge metadata.

Added test case for use of R7 for LR restore. (This happens if fold SP update into push-pop adds a push r7).

Added stack pointer update similar to v4t epilogue code.

Added code for using LR instead of R12 as scavenge register in case that R12 happens to be liveIn the basic block.

This looks good, but I don't feel comfortable signing off on it. @t.p.northover, any thoughts on this?

test/CodeGen/ARM/threeParametersTailCall_v6m.ll
28

I think there's only supposed to be one of these per file. Same for the triple.

Hello Jon,

This looks good, but I don't feel comfortable signing off on it. @t.p.northover, any thoughts on this?

Thank you for reviewing.

I agree. With a change like this, one should better be on the safe side and have more than one review.
I also would delay commiting until I have finished some regression tests on real-world hardware. This week, I'll not be finding the time to do it but I hopefully will do next week. As mentioned, I have test code that extensively uses continuation-style coding with many tail calls.
Also as a second requirement for a commit, I will have to provide fixes for the 4 LIT regression tests that will trigger false failures because of the changed epilogue patterns. I will remove the superflous two lines within this change.

compnerd edited edge metadata.Jan 22 2015, 7:08 PM

Id say merge all the test cases into a single file (thumb1-tail-call.ll ?).

lib/Target/ARM/ARMISelLowering.cpp
1677

Unnecessary braces

1683

Unnecessary whitespace before the last ). Why not collapse this to:

bool ForceCallAddrToIP = isTailCall && (RegsToPass.size() >= 4 && Subtarget->isThumb1Only());
1694

Unnecessary whitespace in condition.

1805

Unnecessary whitespace between getRegister and (.

lib/Target/ARM/Thumb1FrameLowering.cpp
335

Why not collapse this to:

bool IsTailCallReturn = MBBI->getOpcode() == ARM::TCRETURNri;
361

I don't think that the parenthesis are needed and they detract from readability. I would swap the order since the negation check would short-circuit.

403

This should be on the same line as the brace. Can we perhaps fold the knowledge that the fold should be avoided in tail calls into tryFoldSPUpdateIntoPushPop? At the very least, I think the condition should be:

else if (IsTailCallReturn || !tryFoldSPUpdateIntoPushPop(...))
413

Use a switch rather than the if cases?

428

Any reason to not use '='? That is the general style in LLVM.

441

Unnecessary braces.

447

Unnecessary space after the !.

451

Unnecessary space after assert. I would also drop the >= 0.

466

Space after the comma.

468

Unnecessary space before the ++. Actually, switch to the prefixed form as that is preferred in LLVM code style.

511

Perhaps const auto *RI would help reduce the wrapping?

527

Unnecessary braces

539

unnecessary space after assert.

542

The next line should fit on this line. Why the unnecessary dl variable declaration above? Its a single use, why not inline it here.

656

Unnecessary empty line.

660

Seems like this could be written as:

if (MBB.getLastNonDebugInstr()->getOpcode() == ARM::TCRETURNri)
  return true;
lib/Target/ARM/Thumb1RegisterInfo.cpp
425

The text feels weird. Reflow the text?

431

Unnecessary braces.

436

Is ScratchReg a better name perhaps?

When re-analyzing prologue and epilogue generation code for thumb1, it seems to me that there is another issue related to the fix done in r210889. The code in head still refers to DPR registers within a possibly available VFP unit and corresponding register spilling and restoring. For thumb1, DPR registers are, however, inaccessible. Prologue and epilogue, thus, should only consider AAPCS calling convention and not AAPCS-VFP. I doubt that this part of the code had shown up as part of copy-and-paste from thumb2/ARM code.?

It's required to support ARM/Thumb interworking when using SjLj exceptions.