This is an archive of the discontinued LLVM Phabricator instance.

[Inliner] Attempt to more accurately model the cost of loops at minsize
Changes PlannedPublic

Authored by dmgreen on Oct 18 2018, 10:34 AM.

Details

Summary

This is three separate things, which when applied together come to much the same effect as D52716. They are:

  • Reduce the call penalty at minsize to 15. In my testing on Arm and X86, this seemed to be the best spot for reducing codesize. It is still not 0 because of inaccuracies in the inliner cost modelling, but over time can be gradually decreased as things gets better. (I did not change the Threshold as it seemed sensible to also decrease the cost of sub-calls at minsize).
  • If a block has more than one unconditional predecessor, mark each one past the first as non-free. This models the extra branch costs for loops (but can cause problems for cases where the blocks are not in the form they will appear in assembly).
  • Geps that are used by phis are not free. This is attempting to capture the geps in loop IVs. (I'm not sure if there is a better way to capture that.)

All together they make us much less likely to inline small loops at minsize. The patch is D52716 is still better for total codesize in my testing, but this attempts to model things more precisely.

Diff Detail

Event Timeline

dmgreen created this revision.Oct 18 2018, 10:34 AM

(but can cause problems for cases where the blocks are not in the form they will appear in assembly).

I'm not sure what sort of issue you're running into here?


Reducing the call penalty seems like it's a good idea if our inline modeling has improved since it was set. But I'm a little cautious about messing with it on its own. If you decreased both the call penalty and the overall threshold at the same time, it would have the effect of encouraging inlining for functions which call other functions; if that has a good effect on codesize, we should do that. If that's not the effect we want, we should probably just change the overall threshold instead.


The PHI handling change is kind of ugly, but I guess it's roughly right. Not sure we want to guard it with minsize, specifically; it should do the right thing at higher inlining levels.


If we're going to start spending time making changes like this that slightly tweak the inline cost for the sake of minsize, I think we need to be a bit more methodical in terms of testing, so we don't end up playing tug-of-war with inliner tweaks that affect different codebases in different ways. For example, having a series of testcases with calls to small functions near the various thresholds, accompanied by actual measurements of the codesize on a couple targets, so we can be confident that tweaks are actually profitable. For other optimization levels, inlining is more about capturing big optimization opportunities, as opposed to the exact cost of various instructions, so it doesn't matter as much, but I think spending more time to construct tests will pay off for -Oz in particular. We currently have very little test coverage for minsize inlining.

dmgreen planned changes to this revision.Nov 5 2018, 8:17 AM

(but can cause problems for cases where the blocks are not in the form they will appear in assembly).

I'm not sure what sort of issue you're running into here?

IIRC, One example I looked at was code like:

bb1:
  ..
  br end
bb2:
  ..
  br end
end:
  ret

The final assembly will fold the ret's into the previous blocks, so no extra branches (it may have returned a value, with a phi being it the end block). Generally we have to work with what we expect the final form to be, not what the form is right now.

Reducing the call penalty seems like it's a good idea if our inline modeling has improved since it was set. But I'm a little cautious about messing with it on its own. If you decreased both the call penalty and the overall threshold at the same time, it would have the effect of encouraging inlining for functions which call other functions; if that has a good effect on codesize, we should do that. If that's not the effect we want, we should probably just change the overall threshold instead.

I wasn't sure why a subcall should cost more than a single instruction at minsize (baring obvious registry setup). There may be some knock-on effects I'm not thinking of, but if it's only going to be a single instruction, it only needs to be counted as such. Extra spilled registers around calls perhaps?

The PHI handling change is kind of ugly, but I guess it's roughly right. Not sure we want to guard it with minsize, specifically; it should do the right thing at higher inlining levels.

Yeah, I was a little wary of changing anything when optimisation for performance (having to run a lot of benchmarking and the potential reverts). Are register moves "free" there or not? Depends on the target I would guess, but we do count register setup for calls. There was also a test in AArch64/phi.ll (outer13 I think, but it may have been one of the ones before/after), that was failing. Looking again at the test though, it may just be that the test isn't really right, if instcombine can simplify it already.

If we're going to start spending time making changes like this that slightly tweak the inline cost for the sake of minsize, I think we need to be a bit more methodical in terms of testing, so we don't end up playing tug-of-war with inliner tweaks that affect different codebases in different ways. For example, having a series of testcases with calls to small functions near the various thresholds, accompanied by actual measurements of the codesize on a couple targets, so we can be confident that tweaks are actually profitable. For other optimization levels, inlining is more about capturing big optimization opportunities, as opposed to the exact cost of various instructions, so it doesn't matter as much, but I think spending more time to construct tests will pay off for -Oz in particular. We currently have very little test coverage for minsize inlining.

That sounds smart. I have been going with compiling lots of codebases and see what's looks best, adding interesting cases as testcases over time. More smaller testcases would be better, but we can easily hit cases where we are just not modelling the correct thing. It may be difficult to get all the test cases "correct", some that should be inlined may not be, some that are inlined should not. It's sometimes hard to change a test, even if the end result is better overall.

With D52716 in, my main motivation for the code here will disappear. I will hopefully try to rerun the numbers on top of that patch, seeing how well we do now and how much of this is still an improvement.

zzheng resigned from this revision.Feb 25 2021, 9:53 AM