Details
Diff Detail
Event Timeline
Look fine. Cosmetic questions from me. Thanks!
lib/Target/AMDGPU/AMDGPUInline.cpp | ||
---|---|---|
118 | Accumulation of all pointer args to private array, if total size exceeds ArgAllocaCutoff, then bail out. Look fine. I was just confused initially with the comment. | |
187 | CS.getCaller() doesn't need to be called again? Caller set previously. | |
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | ||
329 | This mean if EnableAMDGPUFunctionCalls is set, we will not have early inline anymore? | |
test/CodeGen/AMDGPU/amdgpu-inline.ll | ||
44 | Is existence of "tail call @foo" considered @foo inlined? @foo should be inlined, right? |
Removed extra CS.getCaller() call.
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | ||
---|---|---|
329 | We will not have early inline *all*. We always have selective early inline because we are running opt with the FE before linking, but w/o this new inlining pass it was a pristine llvm simple inliner which did not inline a lot of functions where we would like to see them inlined. Now the default inliner is replaced with our more aggressive version. If still runs twice with both invocations of opt, but it does not need a special pass to mark every function always_inline (which is in a nutshell what -amdgpu-early-inline-all does). Moreover, w/o this change even if we enable calls we will get no calls, as all of the functions will be marked always_inline and then inlined. Then eventually the whole AMDGPUAlwaysInlinePass pass shall be removed when we enable calls. | |
test/CodeGen/AMDGPU/amdgpu-inline.ll | ||
44 | It is GCN-INL1 check, which constitutes to the call with -inline-threshold=1. I.e. for the test purposes I'm practically asking not to inline conventional functions, unless they have a really good reason to be inlined (like use of scratch). The check GCN-INLDEF below represents default threshold under which foo supposed to be inlined. |
I don't like the idea of having a custom inlined, and thinking about this is very premature. We haven't attempted any benchmarking of calls or looked at current inlining behavior or if it could be improved. I would rather not commit this until it is clear it is absolutely necessary.
We did a lot of benchmarking with HSAIL. LLVM general inliner did not change much since then and does not factor stuff important for us.
It's not really meaningful to compare inlining of HSAIL to AMDGPU. HSAIL never implemented any of the TTI cost model (and actually really completely broke it) plus other ABI differences. HSAIL was always using the default ABI using sret / byval among other differences.
lib/Target/AMDGPU/AMDGPUInline.cpp | ||
---|---|---|
31–33 | We should avoid having a custom version of the exact same flag that the regular inliner uses. This would result in some surprises. | |
54 | Hardcoded threshold. We should rely on the default opt-level based thresholds and override getInliningThresholdMultiplier to get the larger target default. That way the standard flags will work as expected. | |
113 | Early return if no callee rather than wrapping the rest of the function in if | |
121 | Skip potentially costly GetUnderlyingObject call if the address space isn't private | |
124–125 | I'm pretty sure you aren't allowed to alloca an unsized type, so this check isn't needed. However you do need to check / skip dynamically sized allocas. | |
167–175 | Call the base implementation and see if it says always or never first rather than reproducing the logic? | |
177–178 | Wrapper calls will (ignoring no inline) always be inlined by the stock inline heuristics, so †his shouldn't be necessary | |
181–183 | This heuristic doesn't make any sense to me. Why does the block count matter? Just use the default cost. | |
test/CodeGen/AMDGPU/amdgpu-inline.ll | ||
76 | Needs tests with multiple private pointers and with the alloca size threshold exceeded |
Addressed review comments.
Rebased to master.
Added detection of the case when a subscript of the same alloca passed multiple times - count it as just a single alloca (found while forging new test).
lib/Target/AMDGPU/AMDGPUInline.cpp | ||
---|---|---|
31–33 | Stock inline hint threshold is only 44% higher than regular threshold. Here I have it 6 times higher. The exact number may and will change, but when I return 9 from getInliningThresholdMultiplier() and multiply hint by the very same number it still will be a very low number, far less than we would really like to have. | |
124–125 | That was for opaque types which we do not use now, and we had allocas on them. | |
167–175 | Base implementation is pure virtual. I could call llvm::getInlineCost() instead directly (it is called from SimpleInliner::getInlineCost() anyway) two times, but I want to avoid expensive CallAnalyzer. | |
177–178 | That is unless we hit MaxBB, in which case we still want to inline it. | |
181–183 | That is to prevent huge compilation time of some programs. Not an ideal heuristic, but better than nothing. |
lib/Target/AMDGPU/AMDGPUInline.cpp | ||
---|---|---|
31–33 | If that is really a problem we should find a different solution. I really don't want to have the set of flags for debugging this changing. |
lib/Target/AMDGPU/AMDGPUInline.cpp | ||
---|---|---|
31–33 | OK, when we hit it next time I guess we could expose something from TTI. |
lib/Target/AMDGPU/AMDGPUInline.cpp | ||
---|---|---|
177–178 | By doing this you could possibly be allowing inlining with incompatible function attributes. llvm:::getInlineCost has the additional functionsHaveCompatibleAttributes check, which is more than the above areInlineCompatible | |
181–183 | Compilation time from what? That it requires this custom wrapper function checking sounds like additional motivation to drop it. | |
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h | ||
165–166 | How did you decide 9 for this? |
lib/Target/AMDGPU/AMDGPUInline.cpp | ||
---|---|---|
181–183 | The actual testcase which required that was VRay AFAIR. When we increase inline threshold (or inline everything like now) we a vulnerable to extremely high compilation times for huge codes. In case of VRay it was hours, and this has decreased it to minutes. | |
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h | ||
165–166 | This is the rounded ratio of standard inline threshold and that tuned for HSAIL. Later we will need to retune. |
lib/Target/AMDGPU/AMDGPUInline.cpp | ||
---|---|---|
177–178 | areInlineCompatible() is called earlier at line 167. If they do not we will not get to this point. |
There should be an explanation of what this pass does and why it is better than LLVM's default inliner and some benchmark data showing which applications / games this helps.
Explanation added. For the benchmarks, the one I used was Luxmark, where it gave ~12% better result comparing to standard inliner. For the others it does not look like it makes a lot of sense to try to recollect old HSAIL numbers.
This pass is ported because we know in advance there is an issue like this. As said before cost model is different and some tuning will be also required later when we turn call support on.
lib/Target/AMDGPU/AMDGPUInline.cpp | ||
---|---|---|
181–183 | I meant where was the compile time spent? I doubt it was the inliner itself |
lib/Target/AMDGPU/AMDGPUInline.cpp | ||
---|---|---|
181–183 | As usual, optimizing inflated code, scheduling and RA. |
lib/Target/AMDGPU/AMDGPUInline.cpp | ||
---|---|---|
181–183 | I'd rather drop this workaround for now at least. |
LGTM. It does look like the default inline heuristic does consider the SROAability of the passed arguments to some degree
We should avoid having a custom version of the exact same flag that the regular inliner uses. This would result in some surprises.