This is an archive of the discontinued LLVM Phabricator instance.

[Outliner] Add an option to only enable outlining of patterns above a certain threshold
ClosedPublic

Authored by lanza on Oct 26 2022, 10:20 AM.

Details

Summary

Outlining isn't always a win when the saved instruction count is >= 1.
The overhead of representing a new function in the binary depends on
exception metadata and alignment. So parameterize this for local tuning.

Diff Detail

Event Timeline

lanza created this revision.Oct 26 2022, 10:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 26 2022, 10:20 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
lanza requested review of this revision.Oct 26 2022, 10:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 26 2022, 10:20 AM
lanza added reviewers: kyulee, smeenai, zer0.EditedOct 26 2022, 10:24 AM

I've found internally that the value here has heavy variance. ~35 is best for uncompressed code on aarch64 Linux with .eh_frame. ~5 seems best for armv7 Linux with .eh_frame. If you're using compression then many outlinings are wasteful compared to compressibility and various project-specific results come out with this parameter finding optimal values up through many hundreds.

When we use our local tool called Superpack it makes just about all outlining patterns a net loss on the compressed representation, so it even makes sense to go into the thousands here.

What's everybody's thoughts on adding such a flag?

This makes sense to me, but we should probably add a test case?

This makes sense to me, but we should probably add a test case?

Yup, will do. Figure I'd get consensus that this is acceptable by upstream first.

Ping @paquette.

This seems like a pretty simple and non-intrusive method to be able to tune the outliner. LGTM if you add a test case.

lanza updated this revision to Diff 506239.Mar 17 2023, 5:13 PM

Add test

Overall it looks good to me. Just simplify the test case that is more resilient to the future changes.

llvm/test/CodeGen/AArch64/machine-outliner-threshold.ll
6 ↗(On Diff #506239)
; CHECK-NOT: OUTLINED_FUNCTION
25 ↗(On Diff #506239)
; CHECK-NOT: OUTLINED_FUNCTION
92 ↗(On Diff #506239)

I would simplify all the following without matching align and the body of outlined function which can vary.

; CHECK: [[OUTLINED]]:
; ALL-DAG: [[OUTLINED]]:
; ALL-DAG: [[OUTLINED2]]:
117 ↗(On Diff #506239)

Is this necessary?

The overhead of representing a new function in the binary depends on exception metadata and alignment

Is it possible to know this without introducing a heuristic?

llvm/lib/CodeGen/MachineOutliner.cpp
121
llvm/test/CodeGen/AArch64/machine-outliner-threshold.ll
1 ↗(On Diff #506239)

MIR testcase?

plotfi edited reviewers, added: plotfi; removed: zer0.Mar 17 2023, 6:26 PM
plotfi added inline comments.
llvm/lib/CodeGen/MachineOutliner.cpp
121

What is the size in? Instructions? Bytes?

lanza updated this revision to Diff 509884.Mar 30 2023, 7:55 PM
lanza marked 6 inline comments as done.

Fix a few things

lanza updated this revision to Diff 509888.Mar 30 2023, 8:13 PM
lanza marked an inline comment as done.

Update for std::optional usage change

My main concern with this patch is that I'd actually like to avoid adding new heuristics to the outliner in general.

The main benefit of having a late outliner is to leverage accurate information about what will actually be emitted by the compiler.

The overhead of representing a new function in the binary depends on exception metadata and alignment

Is it impossible to model this in the compiler? Or do we absolutely need a knob for this?

In general, I think it'd be best to improve the outliner cost model accuracy wherever possible, just because that is an option at this level of representation. But if it's absolutely necessary to add a heuristic, I'm not totally against it,

llvm/test/CodeGen/AArch64/machine-outliner-threshold.ll
1 ↗(On Diff #509888)

this test can be removed

llvm/test/CodeGen/AArch64/machine-outliner-threshold.mir
2

shouldn't need all of that in the runline

3
6

this global is not used

16

don't need noinline

This comment was removed by lanza.

Is it impossible to model this in the compiler? Or do we absolutely need a knob for this? In general, I think it'd be best to improve the outliner cost model accuracy wherever possible, just because that is an option at this level of representation. But if it's absolutely necessary to add a heuristic, I'm not totally against it,

This is what I aimed to do initially. The problem I ran into is that coming up with a model against compression is chaotic in the mathematical sense. e.g. the results varied based on the contents of functions in *other libraries* which we obviously don't have visibility into during outlining. We compress our Android native libraries with SuperPack (https://engineering.fb.com/2021/09/13/core-data/superpack/) which concatenates all our libraries and then compresses them asymmetrically. So a change in libraryA changes the criteria for outlining decisions in libraryB.

The situation is possibly the same for iOS apps as well with IPA compression and any usage of dylibs, though I'm not particularly familiar with this domain.

You could definitely come up with a better heuristic to improve the uncompressed size results. e.g. with -outliner-benefit-threshold=5 I saw over twice the size savings than the default hardcoded 1 yielded. This could be modeled more directly by including, for example, the cost of unwinding info in the outliner cost modeling.

However, that would still be a very poor result if compressed size was important given that it depends

  • the compression tool used
  • the contents of the other libraries/assets being compressed along side
  • the actual valuation your use case has for compressed vs uncompressed code
  • etc

Across our different aarch64 apps I still saw different values prevailing as most effective. e.g. 200 for app1, 250 for app2 and 275 for app3 where the only meaningful variation between these apps would be the contents of other libraries. This is definitively not modelable within the compiler. So even with a "correct" uncompressed model we'd still need an override for compressed needs.

lanza updated this revision to Diff 510987.Apr 4 2023, 8:08 PM
lanza marked an inline comment as done.

clean up test a bit

paquette accepted this revision.Apr 5 2023, 10:16 AM

Alright, considering this cannot be modeled inside the compiler, LGTM

This revision is now accepted and ready to land.Apr 5 2023, 10:16 AM

Remember to remove the IR test case before committing :)

llvm/test/CodeGen/AArch64/machine-outliner-threshold.mir
17

Typo: different

Yeah. Even though we may come up with a base cost model, we may still have a tuning parameter anyhow depending on target or objectives, which won't be much different than having a threshold like this case.
LGTM.

This revision was automatically updated to reflect the committed changes.
lanza marked 2 inline comments as done.