InlineCost's threshold is multiplied by this value. This lets us adjust
the inlining threshold up or down on a per-target basis. For example,
we might want to increase the threshold on targets where calls are
unusually expensive.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
lib/Analysis/TargetTransformInfo.cpp | ||
---|---|---|
71–73 ↗ | (On Diff #51937) | I thought using host FP was to be avoided, so I'm surprised this isn't an integer factor. Couldn't you have one assert on isnormal()? |
lib/Analysis/TargetTransformInfo.cpp | ||
---|---|---|
71–73 ↗ | (On Diff #51937) |
No idea. It seems substantially less flexible to use an integer factor (you can't decrease the inlining threshold except to 0), but it works fine for my case, so I have no problem changing it if that's the convention.
isnormal(0) seems to be false, but giving a multiplier of 0 doesn't seem outrageous to me? |
lib/Analysis/TargetTransformInfo.cpp | ||
---|---|---|
71–73 ↗ | (On Diff #51937) | You could return a rational numerator + denominator pair |
Wouldn't it be better to use an additive bonus to the threshold instead of multiplicative factor to account for expensive calls?
Not necessarily? It seems not unreasonable that if we e.g. went from an un-multiplied inlining threshold of 10 to 100 that the adjusted threshold should also increase by 10x, rather than increase by +90. Like, it depends on what is the intent behind the 10 --> 100 change.
Given how blunt this is, and given that it's used by exactly one target at the moment, I'm inclined not to bikeshed too hard on the exact form it takes. Like, it would be better to rearrange all of the constants inside InlineCost to better reflect costs on the target machine. But the discussion of exactly what form this should take would probably be much more productive at the point that what we have doesn't work for some new target.
If you'd like a target-specific cost adjustment, which seems perfectly reasonable, it should be additive, and should be a function of the call site (i.e. both the caller and the callee, not just the caller). Also, hopefully the rationale for these numbers can be clearly articulated in comments in the TTI implementation, so we know how to adjust them if we change how TTI.getUserCost works.
If you'd like a target-specific cost adjustment, which seems perfectly reasonable, it should be additive,
One of the other reasons I think a multiplicative setting makes sense is, there's no way to change the fudge factor. All you can change is the base inlining threshold. So the question is, as a user changing the inlining threshold, what do I expect to happen on a platform that has an inlining threshold adjustment?
Suppose I set the (global, un-fudged) inlining threshold (again, the only thing I can change) to something very small. It seems to me that should imply that relatively little inlining occurs. But of course an additive fudge doesn't accomplish that. And in fact there's no good way for me to set the post-fudge inlining threshold to a small value -- I'd have to set it to a negative number, and I'd have to calculate that specific value by looking at the TTI.
In contrast, if the fudge is multiplicative, then we have the property that a small pre-fudge inlining threshold results in a small post-fudge threshold.
If a multipliciative fudge is off the table, I think I'd feel more comfortable with the TTI giving an absolute inlining threshold that can be overridden on the command line. Is that any more appealing to you, Hal?
should be a function of the call site (i.e. both the caller and the callee, not just the caller).
Sure, I can make that change once we agree on the rest here.
Also, hopefully the rationale for these numbers can be clearly articulated in comments in the TTI implementation, so we know how to adjust them if we change how TTI.getUserCost works.
Haha, just like the current cross-arch threshold is so clearly motivated? :-p I don't have a good rationale other than, this number worked well for many internal benchmarks at Google. It's absolutely not scientifically arrived at; it's just a starting place.
Clearly we want more inlining on nvptx than on other archs, and, anecdotally, I tried a few benchmarks that showed 10% gains from this change. But ptxas does its own inlining, and it's a black box, so this is all very mushy. And more importantly, speed is only half the story -- inlining big functions is a size/speed tradeoff, and that's harder to quantify. Like, GPUs tend to have relatively little code to begin with, so maybe we have more wiggle room on the size end? But how much -- 1.5x, 5x, 100x? Who knows.
I ended up chatting with Hal about this and he made a really great point about this. I had been thinking that it is really brittle to have the target provided inlining threshold be an absolute number instead of a multiplier / ratio as you have it.
However, Hal pointed out that this creates a coupling that could also be problematic. Consider an out-of-tree target with a carefully tuned inlining threshold multiplier. If lots of targets do this, changing the threshold could become extremely problematic because small changes would would still disturb the target-specific tunings. His suggestion was to use an absolute threshold from the target in the absense of an explicitly specified command line flag.
Thinking about this more, I think it still presents the same problem. Consider a change to the inliner that significantly changes the rate at which we inline things. It might be useful to be able to adjust the threshold when making the change to keep most inlining decisions neutral across targets.
I'm not completely sure that either of these approaches is really resistant to undue coupling...
One possible alternative is to not have the size-based inlining be target configurable, and to make this exclusively handled by the proposed runtime cost estimation based inlining when that is available. Perhaps this could be documented as a temporary hook that will be replaced with the runtime estimation? I'm curious what others think here.
Also, hopefully the rationale for these numbers can be clearly articulated in comments in the TTI implementation, so we know how to adjust them if we change how TTI.getUserCost works.
Haha, just like the current cross-arch threshold is so clearly motivated?
I'm going to write up a document describing the current system since this keeps coming up and causing confusion. I'll also try to sketch how other models would fit into this.
I don't have a good rationale other than, this number worked well for many internal benchmarks at Google. It's absolutely not scientifically arrived at; it's just a starting place.
Clearly we want more inlining on nvptx than on other archs, and, anecdotally, I tried a few benchmarks that showed 10% gains from this change. But ptxas does its own inlining, and it's a black box, so this is all very mushy. And more importantly, speed is only half the story -- inlining big functions is a size/speed tradeoff, and that's harder to quantify. Like, GPUs tend to have relatively little code to begin with, so maybe we have more wiggle room on the size end? But how much -- 1.5x, 5x, 100x? Who knows.
GPUs tend to not have significant icache-style bottlenecks because their kernels are typically small and easily predicted, etc. Still, there remains a fundamental limit on instruction working set size where inlining is profitable.
I think you'll want to do some "science" on this value eventually and document it. Typically we walk the value up and down and try to get an idea of the shape of the aggregate performance curve across a wide range of benchmarks. Then we look for what is usually a large flat mesa in the curve, and go with the small end of that.
-Chandler
Consider an out-of-tree target with a carefully tuned inlining threshold multiplier. If lots of targets do this, changing the threshold could become extremely problematic because small changes would would still disturb the target-specific tunings.
My thought was, the inlining multiplier is such a coarse knob -- set atop rather coarse heuristics -- that tuning it particularly carefully for a target might not make much sense.
I don't know if that's true or not, but it's at least an argument one could make.
His suggestion was to use an absolute threshold from the target in the absense of an explicitly specified command line flag. Thinking about this more, I think it still presents the same problem. Consider a change to the inliner that significantly changes the rate at which we inline things. It might be useful to be able to adjust the threshold when making the change to keep most inlining decisions neutral across targets.
I guess this (and the previous one, to an extent) are a question of API stability guarantees. To the extent that we don't promise to have a stable API for out-of-tree targets, we could change the inliner and at the same time break compilation for any out-of-tree targets that set the threshold, right. They'll have to fix the error and recompute their inlining threshold.
That doesn't seem so unreasonable to me -- it's not like we'd be doing this once a month, or even every release. And as an out-of-tree target, you'd only be broken if you explicitly opted in to this tuning mechanism, which we could make clear comes with this possibility of future intentional breakage.
I think you'll want to do some "science" on this value eventually and document it.
Yes, absolutely. That change doesn't even appear in this patch, though; let's worry about it separately, if that's OK?
This is a concern, and you're right that any of the tuning knobs need to be returned in response to upstream changes. Fundamentally, I'm most concerned with sending the right message here, and not unnecessarily constraining our future actions (because while we're certainly free to change whatever we want in the future, regardless of backend tuning, in practice we try to be nice and accommodating because these things take time).
The message we want to send is:
- This is a temporary hook, and while some mechanism to achieve this affect will be provided in the future, it might look nothing like this.
- This is a coarse heuristic mechanism, not really backed by any self-consistent model.
After chatting with Chandler about this, my understanding of the current mechanism is this: Our current heuristic is size based (with some exceptions), and aims to answer the following question: Will inlining this function make the caller smaller?
Doing this involves two pieces. The first is constructing an estimate of the size of the inlined function (which is truncated at some point to keep the cost of the analysis from becoming quadratic), accounting for constant propagation via function arguments and resulting simplifications. Then we construct an estimate of the caller's size associated with the call site that can be eliminated by inlining (the call itself, allocas that can be eliminated, etc.).
The second piece is more empirical. How many instructions, on average, will be eliminated in the caller by resulting simplification by inlining a function? This is where the current threshold comes in, it is a (rough) estimate of this. Unfortunately, as currently used, it is also not purely an estimate of this, but also accounts for performance effects (which is why, for instance, we have a bonus for callees with vector instructions, etc.).
Longer term, we really need to separate the performance-based metric from the size-based metric. They're estimating different things, although we're unfortunately conflating them currently (by hiding our performance-based metric inside our estimate of size decrease from resulting caller simplification, and to a lesser extent, things like the call penalty).
For your use case, GPUs, you want to represent the fact that calls are really expensive, and you don't care nearly as much about code size, although you do care about compile time. As a result, your limit is more to limit the compile time from later stages of the compiler (and perhaps the inliner itself) than anything else. That having been said, NVIDIA GPUs do have an icache that holds some kilobytes of compiled code, and spilling out of that will cost you (although often not a huge amount).
In the long term, I imagine we want a more top-down heuristic for this kind of inlining, because we want to keep a function, or group of functions, below some absolute size threshold (unless there is a huge compensating performance speedup). The size based heuristic is also good for your use case, but only if it happens in a more pure sense.
So my conclusion is that all of this is wrong ;) -- but in the name of making progress, I'm fine with a way to bump the threshold for some targets. I think it should be just a number, because that's the simplest possible thing we can do. A number if difficult, however, because cannot directly account for -Os, etc. factors that go into the current threshold. Providing a fraction to be multiplied by the current threshold seems problematic because it sends the message that this is something that should be finely tuned, and it shouldn't be. Could this just be an integer multipler for now? That will prevent fine tuning, send the message that this is a temporary coarse knob, and then we can work on a more principled solution. If that would work for you, I think that is the best solution for now.
I think you'll want to do some "science" on this value eventually and document it.
Yes, absolutely. That change doesn't even appear in this patch, though; let's worry about it separately, if that's OK?
Could this just be an integer multipler for now? That will prevent fine tuning, send the message that this is a temporary coarse knob, and then we can work on a more principled solution.
That sounds great to me, I will spin a patch. If anyone else disagrees, speak now or forever hold your peace...
Hal, I hope I didn't misunderstand: I took your comment to suggest that the multiplier should be fixed for a particular target -- it shouldn't depend on the caller/callee. I can add that back if that's not what you had in mind.
This is fine. We can add those things back if we come up with a use case. As discussed, we really need to rework a bunch of this logic. LGTM.