Outlining isn't always a win when the saved instruction count is >= 1.
The overhead of representing a new function in the binary depends on
exception metadata and alignment. So parameterize this for local tuning.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I've found internally that the value here has heavy variance. ~35 is best for uncompressed code on aarch64 Linux with .eh_frame. ~5 seems best for armv7 Linux with .eh_frame. If you're using compression then many outlinings are wasteful compared to compressibility and various project-specific results come out with this parameter finding optimal values up through many hundreds.
When we use our local tool called Superpack it makes just about all outlining patterns a net loss on the compressed representation, so it even makes sense to go into the thousands here.
What's everybody's thoughts on adding such a flag?
Ping @paquette.
This seems like a pretty simple and non-intrusive method to be able to tune the outliner. LGTM if you add a test case.
Overall it looks good to me. Just simplify the test case that is more resilient to the future changes.
llvm/test/CodeGen/AArch64/machine-outliner-threshold.ll | ||
---|---|---|
6 ↗ | (On Diff #506239) | ; CHECK-NOT: OUTLINED_FUNCTION |
25 ↗ | (On Diff #506239) | ; CHECK-NOT: OUTLINED_FUNCTION |
92 ↗ | (On Diff #506239) | I would simplify all the following without matching align and the body of outlined function which can vary. ; CHECK: [[OUTLINED]]: ; ALL-DAG: [[OUTLINED]]: ; ALL-DAG: [[OUTLINED2]]: |
117 ↗ | (On Diff #506239) | Is this necessary? |
llvm/lib/CodeGen/MachineOutliner.cpp | ||
---|---|---|
121 | What is the size in? Instructions? Bytes? |
My main concern with this patch is that I'd actually like to avoid adding new heuristics to the outliner in general.
The main benefit of having a late outliner is to leverage accurate information about what will actually be emitted by the compiler.
The overhead of representing a new function in the binary depends on exception metadata and alignment
Is it impossible to model this in the compiler? Or do we absolutely need a knob for this?
In general, I think it'd be best to improve the outliner cost model accuracy wherever possible, just because that is an option at this level of representation. But if it's absolutely necessary to add a heuristic, I'm not totally against it,
llvm/test/CodeGen/AArch64/machine-outliner-threshold.ll | ||
---|---|---|
1 ↗ | (On Diff #509888) | this test can be removed |
llvm/test/CodeGen/AArch64/machine-outliner-threshold.mir | ||
2 | shouldn't need all of that in the runline | |
3 | ||
6 | this global is not used | |
16 | don't need noinline |
Is it impossible to model this in the compiler? Or do we absolutely need a knob for this? In general, I think it'd be best to improve the outliner cost model accuracy wherever possible, just because that is an option at this level of representation. But if it's absolutely necessary to add a heuristic, I'm not totally against it,
This is what I aimed to do initially. The problem I ran into is that coming up with a model against compression is chaotic in the mathematical sense. e.g. the results varied based on the contents of functions in *other libraries* which we obviously don't have visibility into during outlining. We compress our Android native libraries with SuperPack (https://engineering.fb.com/2021/09/13/core-data/superpack/) which concatenates all our libraries and then compresses them asymmetrically. So a change in libraryA changes the criteria for outlining decisions in libraryB.
The situation is possibly the same for iOS apps as well with IPA compression and any usage of dylibs, though I'm not particularly familiar with this domain.
You could definitely come up with a better heuristic to improve the uncompressed size results. e.g. with -outliner-benefit-threshold=5 I saw over twice the size savings than the default hardcoded 1 yielded. This could be modeled more directly by including, for example, the cost of unwinding info in the outliner cost modeling.
However, that would still be a very poor result if compressed size was important given that it depends
- the compression tool used
- the contents of the other libraries/assets being compressed along side
- the actual valuation your use case has for compressed vs uncompressed code
- etc
Across our different aarch64 apps I still saw different values prevailing as most effective. e.g. 200 for app1, 250 for app2 and 275 for app3 where the only meaningful variation between these apps would be the contents of other libraries. This is definitively not modelable within the compiler. So even with a "correct" uncompressed model we'd still need an override for compressed needs.
Remember to remove the IR test case before committing :)
llvm/test/CodeGen/AArch64/machine-outliner-threshold.mir | ||
---|---|---|
17 | Typo: different |
Yeah. Even though we may come up with a base cost model, we may still have a tuning parameter anyhow depending on target or objectives, which won't be much different than having a threshold like this case.
LGTM.