This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Port of HSAIL inliner
ClosedPublic

Authored by rampitec on Aug 17 2017, 2:01 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec created this revision.Aug 17 2017, 2:01 PM
rampitec updated this revision to Diff 112191.Aug 22 2017, 10:06 AM

Added OptimizationRemarkEmitter as now in master.

Look fine. Cosmetic questions from me. Thanks!

lib/Target/AMDGPU/AMDGPUInline.cpp
117 ↗(On Diff #112191)

Accumulation of all pointer args to private array, if total size exceeds ArgAllocaCutoff, then bail out. Look fine. I was just confused initially with the comment.

186 ↗(On Diff #112191)

CS.getCaller() doesn't need to be called again? Caller set previously.

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
329 ↗(On Diff #112191)

This mean if EnableAMDGPUFunctionCalls is set, we will not have early inline anymore?

test/CodeGen/AMDGPU/amdgpu-inline.ll
43 ↗(On Diff #112191)

Is existence of "tail call @foo" considered @foo inlined? @foo should be inlined, right?

rampitec updated this revision to Diff 112498.Aug 23 2017, 9:07 PM
rampitec marked 3 inline comments as done.

Removed extra CS.getCaller() call.

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
329 ↗(On Diff #112191)

We will not have early inline *all*. We always have selective early inline because we are running opt with the FE before linking, but w/o this new inlining pass it was a pristine llvm simple inliner which did not inline a lot of functions where we would like to see them inlined. Now the default inliner is replaced with our more aggressive version. If still runs twice with both invocations of opt, but it does not need a special pass to mark every function always_inline (which is in a nutshell what -amdgpu-early-inline-all does).

Moreover, w/o this change even if we enable calls we will get no calls, as all of the functions will be marked always_inline and then inlined. Then eventually the whole AMDGPUAlwaysInlinePass pass shall be removed when we enable calls.

test/CodeGen/AMDGPU/amdgpu-inline.ll
43 ↗(On Diff #112191)

It is GCN-INL1 check, which constitutes to the call with -inline-threshold=1. I.e. for the test purposes I'm practically asking not to inline conventional functions, unless they have a really good reason to be inlined (like use of scratch).

The check GCN-INLDEF below represents default threshold under which foo supposed to be inlined.

alfred.j.huang accepted this revision.Aug 23 2017, 9:55 PM
This revision is now accepted and ready to land.Aug 23 2017, 9:55 PM
arsenm requested changes to this revision.Aug 23 2017, 11:02 PM

I don't like the idea of having a custom inlined, and thinking about this is very premature. We haven't attempted any benchmarking of calls or looked at current inlining behavior or if it could be improved. I would rather not commit this until it is clear it is absolutely necessary.

This revision now requires changes to proceed.Aug 23 2017, 11:02 PM

I don't like the idea of having a custom inlined, and thinking about this is very premature. We haven't attempted any benchmarking of calls or looked at current inlining behavior or if it could be improved. I would rather not commit this until it is clear it is absolutely necessary.

We did a lot of benchmarking with HSAIL. LLVM general inliner did not change much since then and does not factor stuff important for us.

I don't like the idea of having a custom inlined, and thinking about this is very premature. We haven't attempted any benchmarking of calls or looked at current inlining behavior or if it could be improved. I would rather not commit this until it is clear it is absolutely necessary.

We did a lot of benchmarking with HSAIL. LLVM general inliner did not change much since then and does not factor stuff important for us.

It's not really meaningful to compare inlining of HSAIL to AMDGPU. HSAIL never implemented any of the TTI cost model (and actually really completely broke it) plus other ABI differences. HSAIL was always using the default ABI using sret / byval among other differences.

I don't like the idea of having a custom inlined, and thinking about this is very premature. We haven't attempted any benchmarking of calls or looked at current inlining behavior or if it could be improved. I would rather not commit this until it is clear it is absolutely necessary.

We did a lot of benchmarking with HSAIL. LLVM general inliner did not change much since then and does not factor stuff important for us.

It's not really meaningful to compare inlining of HSAIL to AMDGPU. HSAIL never implemented any of the TTI cost model (and actually really completely broke it) plus other ABI differences. HSAIL was always using the default ABI using sret / byval among other differences.

What in the default inliner will bump threshold if private pointer is passed?

HSAIL never implemented any of the TTI cost model

And this is also completely false statement.

arsenm added inline comments.Aug 28 2017, 10:24 AM
lib/Target/AMDGPU/AMDGPUInline.cpp
30–32 ↗(On Diff #112498)

We should avoid having a custom version of the exact same flag that the regular inliner uses. This would result in some surprises.

53 ↗(On Diff #112498)

Hardcoded threshold. We should rely on the default opt-level based thresholds and override getInliningThresholdMultiplier to get the larger target default. That way the standard flags will work as expected.

112 ↗(On Diff #112498)

Early return if no callee rather than wrapping the rest of the function in if

120 ↗(On Diff #112498)

Skip potentially costly GetUnderlyingObject call if the address space isn't private

123–124 ↗(On Diff #112498)

I'm pretty sure you aren't allowed to alloca an unsized type, so this check isn't needed.

However you do need to check / skip dynamically sized allocas.

166–174 ↗(On Diff #112498)

Call the base implementation and see if it says always or never first rather than reproducing the logic?

176–177 ↗(On Diff #112498)

Wrapper calls will (ignoring no inline) always be inlined by the stock inline heuristics, so †his shouldn't be necessary

180–182 ↗(On Diff #112498)

This heuristic doesn't make any sense to me. Why does the block count matter? Just use the default cost.

test/CodeGen/AMDGPU/amdgpu-inline.ll
75 ↗(On Diff #112498)

Needs tests with multiple private pointers and with the alloca size threshold exceeded

rampitec updated this revision to Diff 112989.Aug 28 2017, 4:36 PM
rampitec edited edge metadata.
rampitec marked 5 inline comments as done.

Addressed review comments.
Rebased to master.
Added detection of the case when a subscript of the same alloca passed multiple times - count it as just a single alloca (found while forging new test).

lib/Target/AMDGPU/AMDGPUInline.cpp
30–32 ↗(On Diff #112498)

Stock inline hint threshold is only 44% higher than regular threshold. Here I have it 6 times higher. The exact number may and will change, but when I return 9 from getInliningThresholdMultiplier() and multiply hint by the very same number it still will be a very low number, far less than we would really like to have.

123–124 ↗(On Diff #112498)

That was for opaque types which we do not use now, and we had allocas on them.

166–174 ↗(On Diff #112498)

Base implementation is pure virtual. I could call llvm::getInlineCost() instead directly (it is called from SimpleInliner::getInlineCost() anyway) two times, but I want to avoid expensive CallAnalyzer.

176–177 ↗(On Diff #112498)

That is unless we hit MaxBB, in which case we still want to inline it.

180–182 ↗(On Diff #112498)

That is to prevent huge compilation time of some programs. Not an ideal heuristic, but better than nothing.

arsenm added inline comments.Sep 1 2017, 10:31 AM
lib/Target/AMDGPU/AMDGPUInline.cpp
30–32 ↗(On Diff #112498)

If that is really a problem we should find a different solution. I really don't want to have the set of flags for debugging this changing.

rampitec added inline comments.Sep 1 2017, 10:59 AM
lib/Target/AMDGPU/AMDGPUInline.cpp
30–32 ↗(On Diff #112498)

OK, when we hit it next time I guess we could expose something from TTI.

rampitec updated this revision to Diff 113563.Sep 1 2017, 11:31 AM
rampitec marked 4 inline comments as done.

Removed custom inline hint threshold.

arsenm added inline comments.Sep 11 2017, 10:33 AM
lib/Target/AMDGPU/AMDGPUInline.cpp
180–182 ↗(On Diff #112498)

Compilation time from what? That it requires this custom wrapper function checking sounds like additional motivation to drop it.

176–177 ↗(On Diff #113563)

By doing this you could possibly be allowing inlining with incompatible function attributes. llvm:::getInlineCost has the additional functionsHaveCompatibleAttributes check, which is more than the above areInlineCompatible

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
165–166 ↗(On Diff #113563)

How did you decide 9 for this?

rampitec added inline comments.Sep 11 2017, 11:14 AM
lib/Target/AMDGPU/AMDGPUInline.cpp
180–182 ↗(On Diff #112498)

The actual testcase which required that was VRay AFAIR. When we increase inline threshold (or inline everything like now) we a vulnerable to extremely high compilation times for huge codes. In case of VRay it was hours, and this has decreased it to minutes.

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
165–166 ↗(On Diff #113563)

This is the rounded ratio of standard inline threshold and that tuned for HSAIL. Later we will need to retune.

rampitec added inline comments.Sep 11 2017, 11:32 AM
lib/Target/AMDGPU/AMDGPUInline.cpp
176–177 ↗(On Diff #113563)

areInlineCompatible() is called earlier at line 167. If they do not we will not get to this point.

There should be an explanation of what this pass does and why it is better than LLVM's default inliner and some benchmark data showing which applications / games this helps.

rampitec updated this revision to Diff 114648.Sep 11 2017, 11:52 AM

Added file brief and more comments for thresholds.

There should be an explanation of what this pass does and why it is better than LLVM's default inliner and some benchmark data showing which applications / games this helps.

Explanation added. For the benchmarks, the one I used was Luxmark, where it gave ~12% better result comparing to standard inliner. For the others it does not look like it makes a lot of sense to try to recollect old HSAIL numbers.
This pass is ported because we know in advance there is an issue like this. As said before cost model is different and some tuning will be also required later when we turn call support on.

arsenm added inline comments.Sep 12 2017, 8:15 PM
lib/Target/AMDGPU/AMDGPUInline.cpp
180–182 ↗(On Diff #112498)

I meant where was the compile time spent? I doubt it was the inliner itself

rampitec added inline comments.Sep 12 2017, 9:40 PM
lib/Target/AMDGPU/AMDGPUInline.cpp
180–182 ↗(On Diff #112498)

As usual, optimizing inflated code, scheduling and RA.

arsenm added inline comments.Sep 15 2017, 10:34 AM
lib/Target/AMDGPU/AMDGPUInline.cpp
180–182 ↗(On Diff #112498)

I'd rather drop this workaround for now at least.

rampitec updated this revision to Diff 115435.Sep 15 2017, 11:23 AM
rampitec marked 7 inline comments as done.

Removed MaxBB limit.

lib/Target/AMDGPU/AMDGPUInline.cpp
176–177 ↗(On Diff #112498)

AFAIR I've seen cases when our library call wrappers were not inlined even with relatively small programs.

180–182 ↗(On Diff #112498)

OK

arsenm accepted this revision.Sep 19 2017, 6:21 PM

LGTM. It does look like the default inline heuristic does consider the SROAability of the passed arguments to some degree

This revision is now accepted and ready to land.Sep 19 2017, 6:21 PM
rampitec updated this revision to Diff 115959.Sep 19 2017, 9:23 PM

Rebase to master.

This revision was automatically updated to reflect the committed changes.