This is an archive of the discontinued LLVM Phabricator instance.

[MachineOutliner][ARM][RFC] Add Machine Outliner support for ARM
AbandonedPublic

Authored by yroux on Jan 22 2019, 6:38 AM.

Details

Summary

Here is a proposal for Machine Outliner support on ARM targets. It is based on
the AArch64 implementation, and almost up-to-date with it. This implementation
is not intended to be checked-in as it, but is made available for testing and
feedback.

What are the differencies with AArch64 implementation?

  • ARM backend contains a Constant Island pass which splits constant pools inside functions, this pass needs to be made after the Machine Outliner to avoid the distance increase between the islands and the constant usage inside outlined functions and the potential breakage.
  • Handling of ARM, Thumb2 and Thumb1 code generation.
  • Branch instructions can be conditional in ARM, only unconditional ones can be outlined as a basic block terminator.
  • PIC instructions (i.e PICADD, ...) contain labels, and can't be outlined since offset computing would be broken.
  • Only one commit from the AArch64 implementation is missing, it's the one which moves the stack instructions check logic into getOutliningCandidateInfo, I plan to replicate it here, but it can be done later.

What is the status?

  • ARM and Thumb2 are implemented and tested, LLVM+Clang was bootstrapped with the machine outliner turned on, regression testsuite passed with failures (clang binary code size is reduced by ~6% for ARM and ~3.5% for Thumb2)
  • Thumb1 is handled but not fully tested yet.

What is missing and needs to be done before check-in?

  • More testcases should be added in particular MIR ones to check stack fix-ups, ...
  • The patch can be splitted, I don't really see how to do it for the support itself, but at least the move of the Constant Island pass and the properties added to the outlined functions (such that this pass can be run) can be contributed as a prerequisite.
  • The up-comming comments should be addressed ;)

Diff Detail

Repository
rL LLVM

Event Timeline

yroux created this revision.Jan 22 2019, 6:38 AM

Thanks very much for working on this, those numbers look promising. I had a quick read over, and have a few comments. I haven't looked at the checking logic in detail yet, mostly just the CodeGen part.

lib/Target/ARM/ARMBaseInstrInfo.cpp
5298

I think we should probably try to do something about these bare numbers as costs. Especially as things are pretty complicated with ARM/Thumb2/Thumb1. It's sometimes not quite clear just what cost we're talking about.

Maybe a special struct type with named costs that can be referred to when needed?

struct OutlineCosts {
  int TailCall;
  int Call;
  int SaveLRToStack;
  [...]
};
5722

I think it's more idiomatic to use BuildMI with an MBB and an iterator rather than separate build & insert steps.

5755

Comment doesn't quite fit here, and code should maybe be factored into a helper function.

5893

This probably ought to be a copyPhysReg, if you look in Thumb1InstrInfo are some really weird edge cases on old CPUs. As a bonus you don't even need to care about ARM vs Thumb if you do that.

5921–5923

I think we need to talk about stack alignment here. This sequence misaligns the stack on ARM and Thumb2.

If you take the view that the call to OUTLINED_FUNCTION is not a public interface (plausible, but by no means a definite) then the corresponding push for a nested call in buildOutlinedFrame will realign the stack except on Apple's watchOS ABI (armv7k, which has a 16-byte requirement).

If it is public, then this definitely has to change.

Either way, it should definitely be noted in comments.

5966

Indentation.

5967–5970

Haven't we just popped what originally came from r7? And If there's no frame pointer couldn't r7 be a normal register that has changed inside the outlined function?

Also, I think BuildMIs that insert at the same time would definitely be clearer in this function. In outline:

 // Do all kinds of save
 if (Thumb)
    BuildMI(...);
 else
    BuildMI(...);

 // Insert call
BuildMI(....);

 // Do all kinds of restore
 if (Thumb)
   BuildMI(...);
 else
   BuildMI(...);

I'd be pretty tempted to add virtual helper functions actually: spillLRToSTack and restoreLRFromStack.

Thanks for looking into this!

I added some comments.

I think that the biggest thing that stands out to me is that supporting Thumb means that some of the target hooks get extremely long. It'd be nice to split those up into two cases if possible.

lib/Target/ARM/ARMBaseInstrInfo.cpp
5428

Capitalization

5704–5706

Don't need brackets here.

5844–5846

This function is pretty long. I think it'd be a bit easier to follow if it was split into the Thumb and non-Thumb cases if possible.

Maybe something like a insertOutliner[WhateverInstructionWe'reAdding]Inst which returns the next value of It would work? E.g

if (C.CallConstructionID == MachineOutlinerTailCall) {
    It = insertOutlinerTailCallInst(Thumb);
    return It;
}

if (C.CallConstructionID == MachineOutlinerNoLRSave ||
      C.CallConstructionID == MachineOutlinerThunk) {
    It  = insertOutlinerCallInst(Thumb);
    return It;
}

...
lib/Target/ARM/ARMTargetMachine.cpp
242

Hmmm, this is somewhat bold. I (conservatively) think that it'd be good to split turning it on by default into a separate patch.

515

Is this always necessary, or only when the outliner is added? If it's not always necessary, would it be possible to only add it if the outliner was added?

efriedma added a subscriber: efriedma.
efriedma added inline comments.
lib/Target/ARM/ARMBaseInstrInfo.cpp
5127

You probably want to describe the overhead in bytes, not instructions, since they're substantially different in Thumb mode.

5302

supportsTailCall is more general than you need for MachineOutlinerTailCall; on Thumb1, you can outline a tail call if LR isn't live-in (if the terminator is a tPOP_RET etc.).

5310

supportsTailCall is a bit more general than you need for MachineOutlinerThunk; rewriting tBLX to tBX is legal in Thumb1. Also, missing check for tBLX (both here and elsewhere).

5528

This list isn't complete for Thumb; missing tLDRpci_pic, t2LDRpci_pic, t2MOVi16_ga_pcrel, t2MOVTi16_ga_pcrel.

5722

There are two different Thumb direct tail-call opcodes: tTAILJMPdND and tTAILJMPd; the latter is used for MachO targets because the assembler doesn't support relaxing a Thumb1 jump instruction.

5828

Not sure why you aren't using tPOP_RET for Thumb2.

lib/Target/ARM/ARMTargetMachine.cpp
515

createARMConstantIslandPass always has to run at some point. Running it after the outliner allows outlining constant pool references (which probably helps codesize, although it could increase codesize in some cases).

Looking at the passes that run between addPreEmitPass and addPreEmitPass2, moving it makes sense. I think the only other affected pass is FuncletLayout, and it probably wants to run before ConstantIslands anyway (ARM Windows EH currently isn't implemented, but might be in the future).

yroux marked 17 inline comments as done.Jan 23 2019, 7:10 AM

Thanks Tim, Jessica and Eli for the quick review!

I agree with the overall comment that mixing ARM and Thumb makes things long and hard to follow I'll split some parts and add some helpers as suggested.

lib/Target/ARM/ARMBaseInstrInfo.cpp
5127

It'd make more sense, indeed.

5298

I agree, I will give it a try.

5302

Indeed, I don't even understand why SupportsTailCall is defined as !isThumb() || hasV8MBaselineOps();

5310

Oh, yes!

5428

OK

5528

OK, I'll add these.

5704–5706

OK

5722

Tim, my intention is to stay close to AArch64 machine outliner code, such that improvements can be easily added to both targets, what about keeping it like that in a first step and then switch both targets to the idiomatic form ?

5722

OK, I'll add the MachO version

5755

hmm, maybe the comment can be more precise like "Create and Insert a save before outlined region." but I find it coherent with the code. Anyway, I agree with the overall comment that these functions are too long and a helper function would make it clearer.

5828

I'm not sure myself, I did it a while ago and need to re-think about it.

5844–5846

My first implementation for ARM mode only was made inside ARMInstrInfo.cpp, then I moved everything here to avoid duplication between ARM and Thumb, but I agree that splitting it a bit and adding some helpers as suggested by Tim, will make it easier to follow.

5893

OK will do.

5921–5923

Correct me Jessica if I'm wrong, my understanding of outlined functions is that they are not public interfaces and don't even respect the ABI but it is not an issue since we have the control on all the call sites, but I didn't thought of such cases of stack misalignment and targets requirement. I need to think more about it, but I agree at least a comment is needed.

5966

OK

5967–5970

Ah yes, as I said in the intro Thumb1 wasn't really tested and this might be a remaining of my experiments, I'll re-design this part as suggested, and pick-up a register which is not used inside the outlined region when there is no frame pointer.

lib/Target/ARM/ARMTargetMachine.cpp
242

Oh yes, that was just used for testing, turning it on by default and bundle it with -Oz will come later.

efriedma added inline comments.Jan 23 2019, 1:05 PM
lib/Target/ARM/ARMBaseInstrInfo.cpp
5302

supportsTailCall is really about whether isel supports generating a tail call, not what the underlying instruction set supports. See also https://reviews.llvm.org/D49465 .

hasV8MBaselineOps() is the target feature that determines whether t2B is supported, in case that wasn't obvious.

t.p.northover added inline comments.Jan 24 2019, 1:57 AM
lib/Target/ARM/ARMBaseInstrInfo.cpp
5921–5923

That's a good point about the PCS, so it's almost certainly not morally a public interface.

We sort of still have to worry about what the linker might insert between the branch to the outlined function and actually getting there, but since this is all within a single linkage unit I think the worst would be a simple branch island, unlikely to rely on stack alignment. And if it did, it might rely on far more that we break.

We do still need to guarantee 16-byte alignment for nested calls on armv7k though.

Thanks for working on this! I ran some tests and it's looking great (except for a couple of outliers).

lib/Target/ARM/ARMBaseInstrInfo.cpp
5431

What if we are actually storing sp? As in something like,
t2STRi8 $sp, $r7, 12, ...

yroux marked an inline comment as done.Jan 28 2019, 8:06 AM

Thanks for the review David,

what kind of outliers have you seen ? So far the only cases of code size being increase with outlining is due to some padding added to align the oulined functions.

lib/Target/ARM/ARMBaseInstrInfo.cpp
5431

hmm, there might be an issue if the stored value (with a fix-up) is used after the outlined code indeed, I think that this kind of issue didn't occurred in my tests because of the constraints made at lines 5645-5697, I'll try to exhibit the issue with an mir test.

The only increase I was seeing was because of D57254 (and perhaps the alignment thing you mention, but that was very small/rare, iirc). The main part was definitely the exception handlers pulling in a big chunk more code. The other results were looking great.

lib/Target/ARM/ARMBaseInstrInfo.cpp
5431

I think it was also trying to access operand Idx (= 1) as an Imm, not a Reg, which is what made me spot it. I can probably get a reproducer, if it would be useful.

t.p.northover added inline comments.Feb 8 2019, 2:57 AM
lib/Target/ARM/ARMBaseInstrInfo.cpp
5921–5923

I've done some more thinking about this and I believe it's actually an issue for everyone else too: Not all callsites will have the same LR-saving behaviour so the outlined function doesn't know how much it needs to adjust the stack by to preserve alignment.

So far I see three possible solutions:

  • Reserve LR so the issue never comes up.
  • Make all stack-based saves & restores preserve the proper alignment.
  • Add new outlining kinds for the different global cases and use compressed sequences where possible (mostly when the callsites are homogeneous).

I've tested the first two a little, and actually it seems that reserving LR generates smaller code! Smaller even than when the stack gets misaligned. The extra register pressure seems to be offset by the fact that outlining is significantly smaller.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 8 2019, 2:57 AM
yroux marked an inline comment as done.Feb 13 2019, 11:44 PM
yroux added inline comments.
lib/Target/ARM/ARMBaseInstrInfo.cpp
5921–5923

Hi Tim,

I came to the same conclusion that alignment must be preserved (If some LDREXD , STREXD sequence can be reached from the outlined region for instance doubleword alignment is mandatory), I'm working on a save/restore solution which preserve the proper alignment.

I don't see the cases you described where not all callsites have the same LR-saving behaviour, do you have an example and/or a reproducer which exhibits that reserving LR generates smaller code ? (if it is too long to put it here, you can email it)

smeenai added a subscriber: smeenai.Apr 1 2019, 9:58 PM

Any updates here?

Are there plans to move this forward?

I have no plans to work on it. If someone wants to pick it up, I'd be happy to answer any questions.

monat added a subscriber: monat.Nov 20 2019, 8:46 AM
yroux marked an inline comment as not done.Nov 22 2019, 2:24 AM

Sorry, for the delay, I'm finalizing some testcases and will send an update version, hopefully next week

yroux updated this revision to Diff 236838.Jan 8 2020, 9:21 AM

Hi,

here is an update for the Machine Outliner support on ARM targets.
It addresses the comments received on the initial version and is rebased on current trunk.
Among the modifications needed by this rebase, this patch propose to move the ARM Low Overhead Loops pass before the Outliner and the Constant Island ones, because it requires the function to track the liveness (which is not the case of the outlined ones) and moving the Constant Island at the end seems more appropriate. If it is ok I'll split the patch and propose the move in a different review, but here is the whole thing such that you can test it.

Thanks

Herald added a project: Restricted Project. · View Herald TranscriptJan 8 2020, 9:21 AM

It might be possible to rearrange Low Overhead Loops to run before ConstantIslands, but you'd probably need to do more to make it work properly. I don't think ConstantIslands knows how to handle the branches generated by LowOverheadLoop. If you think it's necessary, please split it into a separate patch with proper tests.

Just a quick message, linking in @samparker, and I guess moving Low Overhead Loops to run before ConstantIslands could be problematic, but we can/should have a proper look tomorrow.

yroux added a comment.Jan 8 2020, 10:51 PM

Ah right, let's wait for Sam comments/advises and I'll prepare a proper patch for that.

It might be possible to rearrange Low Overhead Loops to run before ConstantIslands, but you'd probably need to do more to make it work properly. I don't think ConstantIslands knows how to handle the branches generated by LowOverheadLoop.

This, and the fact that LowOverheadLoops is dependent upon block sizes and placement (I need to add a comment in the header on that pass....). So, ConstantIslands would have to guarantee not to change branch targets or change the distance between blocks around the low-overhead loop constructs. I think updating liveness information would be the easier option.

yroux added a comment.Jan 9 2020, 3:25 AM

This, and the fact that LowOverheadLoops is dependent upon block sizes and placement (I need to add a comment in the header on that pass....). So, ConstantIslands would have to guarantee not to change branch targets or change the distance between blocks around the low-overhead loop constructs. I think updating liveness information would be the easier option.

I can add liveness info into the outlined function, but we will need to do the same on AArch64 even if is not necessary. On the other hand isn't it the same for the ConstantIslands pass, I mean isn't the size of blocks modified by the LowOverheadLoops pass which can potentially break the accesses to some constant ? The experiment also shows that the optmization of cmp/beq into cbz made by ConstantIsland place is unleashed when the LowOverheadLoops is made before.

I can add liveness info into the outlined function, but we will need to do the same on AArch64 even if is not necessary.

Hmm, would it be possible to pass a bool to the pass which will control whether liveness is updated? The other, suboptimal, option would be to not do outlining when we have the LOB extension.

On the other hand isn't it the same for the ConstantIslands pass, I mean isn't the size of blocks modified by the LowOverheadLoops pass which can potentially break the accesses to some constant ?

Yes, but LowOverheadLoops does not increase block size or change the CFG. We mainly remove instructions but in the cases where we add instructions, they're replacing pseudos with a size that represents the maximum number of instructions that could be inserted.

yroux added a comment.Jan 9 2020, 6:32 AM

! In D57054#1811740, @samparker wrote:
Hmm, would it be possible to pass a bool to the pass which will control whether liveness is updated? The other, suboptimal, option would be to not do outlining when we have the LOB extension.

Ok maybe, at least in a first keeping the current pass order and disabling outliner when LOB is turned on is the thing to do, and we can move to more optimized version later.

Yes, but LowOverheadLoops does not increase block size or change the CFG. We mainly remove instructions but in the cases where we add instructions, they're replacing pseudos with a size that represents the maximum number of instructions that could be inserted.

Even if you only decrease the size you can have a distance increase due to aligment constraints, for instance if LOB removes some instructions before a load of a constant, then we have .align directive before a constant island, the load is moved upward and the distance between the load and the constant has increased

Even if you only decrease the size you can have a distance increase due to aligment constraints

Ah... thanks for raising this. I'll look into it.

yroux updated this revision to Diff 241188.Jan 29 2020, 9:14 AM

Here is a new version, which disables ARM Low Overhead Loops pass when the Machine Outliner is enabled.

How about removing Thumb-1 support until it's properly handled? I also suggest that this gets broken down a bit too, by handling the different types of function call in different patches, like getting tail-call support in first or something. Also, is this currently set to run all the time and not just when optimising for size?

llvm/lib/CodeGen/MachineOutliner.cpp
1159 ↗(On Diff #241188)

Looks like these should these be set in getRequiredProperties.

yroux marked 2 inline comments as done.Feb 13 2020, 4:47 AM

Thanks for the comments Sam,

Thumb1 is disabled in this version (see inline comment).

The is to bundle Machine Outliner to -Oz like it was done for AArch64, but here it just run when the flags are used (-moutline for clang or -enable-machine-outliner for llc) and not all the time.

Ok I'll try see if I can split it on the outlining kind as you suggest.

llvm/lib/CodeGen/MachineOutliner.cpp
1159 ↗(On Diff #241188)

Maybe @paquette can comment on that

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
5800–5802 ↗(On Diff #241188)

Thumb1 is disabled here.

yroux marked an inline comment as done.Feb 13 2020, 5:30 AM
yroux added inline comments.
llvm/lib/CodeGen/MachineOutliner.cpp
1159 ↗(On Diff #241188)

Just to clarify, these are properties which are given to the created outlined functions and are required by the next passes (such as Constant Island), thus this is not part of getRequiredProperties

yroux abandoned this revision.Jan 19 2021, 2:02 AM

All patches are now committed.