Page MenuHomePhabricator

[MachineOutliner][ARM][RFC] Add Machine Outliner support for ARM
Needs ReviewPublic

Authored by yroux on Jan 22 2019, 6:38 AM.

Details

Summary

Here is a proposal for Machine Outliner support on ARM targets. It is based on
the AArch64 implementation, and almost up-to-date with it. This implementation
is not intended to be checked-in as it, but is made available for testing and
feedback.

What are the differencies with AArch64 implementation?

  • ARM backend contains a Constant Island pass which splits constant pools inside functions, this pass needs to be made after the Machine Outliner to avoid the distance increase between the islands and the constant usage inside outlined functions and the potential breakage.
  • Handling of ARM, Thumb2 and Thumb1 code generation.
  • Branch instructions can be conditional in ARM, only unconditional ones can be outlined as a basic block terminator.
  • PIC instructions (i.e PICADD, ...) contain labels, and can't be outlined since offset computing would be broken.
  • Only one commit from the AArch64 implementation is missing, it's the one which moves the stack instructions check logic into getOutliningCandidateInfo, I plan to replicate it here, but it can be done later.

What is the status?

  • ARM and Thumb2 are implemented and tested, LLVM+Clang was bootstrapped with the machine outliner turned on, regression testsuite passed with failures (clang binary code size is reduced by ~6% for ARM and ~3.5% for Thumb2)
  • Thumb1 is handled but not fully tested yet.

What is missing and needs to be done before check-in?

  • More testcases should be added in particular MIR ones to check stack fix-ups, ...
  • The patch can be splitted, I don't really see how to do it for the support itself, but at least the move of the Constant Island pass and the properties added to the outlined functions (such that this pass can be run) can be contributed as a prerequisite.
  • The up-comming comments should be addressed ;)

Diff Detail

Repository
rL LLVM

Event Timeline

yroux created this revision.Jan 22 2019, 6:38 AM

Thanks very much for working on this, those numbers look promising. I had a quick read over, and have a few comments. I haven't looked at the checking logic in detail yet, mostly just the CodeGen part.

lib/Target/ARM/ARMBaseInstrInfo.cpp
5298

I think we should probably try to do something about these bare numbers as costs. Especially as things are pretty complicated with ARM/Thumb2/Thumb1. It's sometimes not quite clear just what cost we're talking about.

Maybe a special struct type with named costs that can be referred to when needed?

struct OutlineCosts {
  int TailCall;
  int Call;
  int SaveLRToStack;
  [...]
};
5722

I think it's more idiomatic to use BuildMI with an MBB and an iterator rather than separate build & insert steps.

5755

Comment doesn't quite fit here, and code should maybe be factored into a helper function.

5893

This probably ought to be a copyPhysReg, if you look in Thumb1InstrInfo are some really weird edge cases on old CPUs. As a bonus you don't even need to care about ARM vs Thumb if you do that.

5921–5923

I think we need to talk about stack alignment here. This sequence misaligns the stack on ARM and Thumb2.

If you take the view that the call to OUTLINED_FUNCTION is not a public interface (plausible, but by no means a definite) then the corresponding push for a nested call in buildOutlinedFrame will realign the stack except on Apple's watchOS ABI (armv7k, which has a 16-byte requirement).

If it is public, then this definitely has to change.

Either way, it should definitely be noted in comments.

5966

Indentation.

5967–5970

Haven't we just popped what originally came from r7? And If there's no frame pointer couldn't r7 be a normal register that has changed inside the outlined function?

Also, I think BuildMIs that insert at the same time would definitely be clearer in this function. In outline:

 // Do all kinds of save
 if (Thumb)
    BuildMI(...);
 else
    BuildMI(...);

 // Insert call
BuildMI(....);

 // Do all kinds of restore
 if (Thumb)
   BuildMI(...);
 else
   BuildMI(...);

I'd be pretty tempted to add virtual helper functions actually: spillLRToSTack and restoreLRFromStack.

Thanks for looking into this!

I added some comments.

I think that the biggest thing that stands out to me is that supporting Thumb means that some of the target hooks get extremely long. It'd be nice to split those up into two cases if possible.

lib/Target/ARM/ARMBaseInstrInfo.cpp
5428

Capitalization

5704–5706

Don't need brackets here.

5844–5846

This function is pretty long. I think it'd be a bit easier to follow if it was split into the Thumb and non-Thumb cases if possible.

Maybe something like a insertOutliner[WhateverInstructionWe'reAdding]Inst which returns the next value of It would work? E.g

if (C.CallConstructionID == MachineOutlinerTailCall) {
    It = insertOutlinerTailCallInst(Thumb);
    return It;
}

if (C.CallConstructionID == MachineOutlinerNoLRSave ||
      C.CallConstructionID == MachineOutlinerThunk) {
    It  = insertOutlinerCallInst(Thumb);
    return It;
}

...
lib/Target/ARM/ARMTargetMachine.cpp
242

Hmmm, this is somewhat bold. I (conservatively) think that it'd be good to split turning it on by default into a separate patch.

515

Is this always necessary, or only when the outliner is added? If it's not always necessary, would it be possible to only add it if the outliner was added?

efriedma added a subscriber: efriedma.
efriedma added inline comments.
lib/Target/ARM/ARMBaseInstrInfo.cpp
5127

You probably want to describe the overhead in bytes, not instructions, since they're substantially different in Thumb mode.

5302

supportsTailCall is more general than you need for MachineOutlinerTailCall; on Thumb1, you can outline a tail call if LR isn't live-in (if the terminator is a tPOP_RET etc.).

5310

supportsTailCall is a bit more general than you need for MachineOutlinerThunk; rewriting tBLX to tBX is legal in Thumb1. Also, missing check for tBLX (both here and elsewhere).

5528

This list isn't complete for Thumb; missing tLDRpci_pic, t2LDRpci_pic, t2MOVi16_ga_pcrel, t2MOVTi16_ga_pcrel.

5722

There are two different Thumb direct tail-call opcodes: tTAILJMPdND and tTAILJMPd; the latter is used for MachO targets because the assembler doesn't support relaxing a Thumb1 jump instruction.

5828

Not sure why you aren't using tPOP_RET for Thumb2.

lib/Target/ARM/ARMTargetMachine.cpp
515

createARMConstantIslandPass always has to run at some point. Running it after the outliner allows outlining constant pool references (which probably helps codesize, although it could increase codesize in some cases).

Looking at the passes that run between addPreEmitPass and addPreEmitPass2, moving it makes sense. I think the only other affected pass is FuncletLayout, and it probably wants to run before ConstantIslands anyway (ARM Windows EH currently isn't implemented, but might be in the future).

yroux marked 17 inline comments as done.Jan 23 2019, 7:10 AM

Thanks Tim, Jessica and Eli for the quick review!

I agree with the overall comment that mixing ARM and Thumb makes things long and hard to follow I'll split some parts and add some helpers as suggested.

lib/Target/ARM/ARMBaseInstrInfo.cpp
5127

It'd make more sense, indeed.

5298

I agree, I will give it a try.

5302

Indeed, I don't even understand why SupportsTailCall is defined as !isThumb() || hasV8MBaselineOps();

5310

Oh, yes!

5428

OK

5528

OK, I'll add these.

5704–5706

OK

5722

Tim, my intention is to stay close to AArch64 machine outliner code, such that improvements can be easily added to both targets, what about keeping it like that in a first step and then switch both targets to the idiomatic form ?

5722

OK, I'll add the MachO version

5755

hmm, maybe the comment can be more precise like "Create and Insert a save before outlined region." but I find it coherent with the code. Anyway, I agree with the overall comment that these functions are too long and a helper function would make it clearer.

5828

I'm not sure myself, I did it a while ago and need to re-think about it.

5844–5846

My first implementation for ARM mode only was made inside ARMInstrInfo.cpp, then I moved everything here to avoid duplication between ARM and Thumb, but I agree that splitting it a bit and adding some helpers as suggested by Tim, will make it easier to follow.

5893

OK will do.

5921–5923

Correct me Jessica if I'm wrong, my understanding of outlined functions is that they are not public interfaces and don't even respect the ABI but it is not an issue since we have the control on all the call sites, but I didn't thought of such cases of stack misalignment and targets requirement. I need to think more about it, but I agree at least a comment is needed.

5966

OK

5967–5970

Ah yes, as I said in the intro Thumb1 wasn't really tested and this might be a remaining of my experiments, I'll re-design this part as suggested, and pick-up a register which is not used inside the outlined region when there is no frame pointer.

lib/Target/ARM/ARMTargetMachine.cpp
242

Oh yes, that was just used for testing, turning it on by default and bundle it with -Oz will come later.

efriedma added inline comments.Jan 23 2019, 1:05 PM
lib/Target/ARM/ARMBaseInstrInfo.cpp
5302

supportsTailCall is really about whether isel supports generating a tail call, not what the underlying instruction set supports. See also https://reviews.llvm.org/D49465 .

hasV8MBaselineOps() is the target feature that determines whether t2B is supported, in case that wasn't obvious.

t.p.northover added inline comments.Jan 24 2019, 1:57 AM
lib/Target/ARM/ARMBaseInstrInfo.cpp
5921–5923

That's a good point about the PCS, so it's almost certainly not morally a public interface.

We sort of still have to worry about what the linker might insert between the branch to the outlined function and actually getting there, but since this is all within a single linkage unit I think the worst would be a simple branch island, unlikely to rely on stack alignment. And if it did, it might rely on far more that we break.

We do still need to guarantee 16-byte alignment for nested calls on armv7k though.

Thanks for working on this! I ran some tests and it's looking great (except for a couple of outliers).

lib/Target/ARM/ARMBaseInstrInfo.cpp
5431

What if we are actually storing sp? As in something like,
t2STRi8 $sp, $r7, 12, ...

yroux marked an inline comment as done.Jan 28 2019, 8:06 AM

Thanks for the review David,

what kind of outliers have you seen ? So far the only cases of code size being increase with outlining is due to some padding added to align the oulined functions.

lib/Target/ARM/ARMBaseInstrInfo.cpp
5431

hmm, there might be an issue if the stored value (with a fix-up) is used after the outlined code indeed, I think that this kind of issue didn't occurred in my tests because of the constraints made at lines 5645-5697, I'll try to exhibit the issue with an mir test.

The only increase I was seeing was because of D57254 (and perhaps the alignment thing you mention, but that was very small/rare, iirc). The main part was definitely the exception handlers pulling in a big chunk more code. The other results were looking great.

lib/Target/ARM/ARMBaseInstrInfo.cpp
5431

I think it was also trying to access operand Idx (= 1) as an Imm, not a Reg, which is what made me spot it. I can probably get a reproducer, if it would be useful.

t.p.northover added inline comments.Feb 8 2019, 2:57 AM
lib/Target/ARM/ARMBaseInstrInfo.cpp
5921–5923

I've done some more thinking about this and I believe it's actually an issue for everyone else too: Not all callsites will have the same LR-saving behaviour so the outlined function doesn't know how much it needs to adjust the stack by to preserve alignment.

So far I see three possible solutions:

  • Reserve LR so the issue never comes up.
  • Make all stack-based saves & restores preserve the proper alignment.
  • Add new outlining kinds for the different global cases and use compressed sequences where possible (mostly when the callsites are homogeneous).

I've tested the first two a little, and actually it seems that reserving LR generates smaller code! Smaller even than when the stack gets misaligned. The extra register pressure seems to be offset by the fact that outlining is significantly smaller.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 8 2019, 2:57 AM
yroux marked an inline comment as done.Feb 13 2019, 11:44 PM
yroux added inline comments.
lib/Target/ARM/ARMBaseInstrInfo.cpp
5921–5923

Hi Tim,

I came to the same conclusion that alignment must be preserved (If some LDREXD , STREXD sequence can be reached from the outlined region for instance doubleword alignment is mandatory), I'm working on a save/restore solution which preserve the proper alignment.

I don't see the cases you described where not all callsites have the same LR-saving behaviour, do you have an example and/or a reproducer which exhibits that reserving LR generates smaller code ? (if it is too long to put it here, you can email it)

smeenai added a subscriber: smeenai.Apr 1 2019, 9:58 PM

Any updates here?