This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Modify adjustInliningThreshold to also consider the cost of passing function arguments through the stack
ClosedPublic

Authored by JanekvO on Dec 16 2022, 12:10 PM.

Details

Summary

A regression from when new PM got enabled as default. Functions with a big number of instructions will elide getting inlined but do not consider the cost of passing arguments over stack if there are a lot of function arguments. This patch attempts to add a heuristic for AMDGPU's function calling convention that also considers function arguments passed through the stack.

Diff Detail

Event Timeline

JanekvO created this revision.Dec 16 2022, 12:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 16 2022, 12:10 PM
JanekvO requested review of this revision.Dec 16 2022, 12:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 16 2022, 12:10 PM
JanekvO added a reviewer: Restricted Project.Dec 16 2022, 12:11 PM
JanekvO added inline comments.Dec 16 2022, 12:14 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1187

I added function passing through SGPRs as it was described for non-kernel functions in the AMDGPU backend user guide but the calling convention tablegen file for AMDGPU seems to only use VGPRs for passing arguments. Not sure if the aim is to add function argument passing through SGPRs and I should keep it in or whether to remove this.

arsenm requested changes to this revision.Dec 16 2022, 12:17 PM
arsenm added inline comments.
llvm/lib/Analysis/InlineCost.cpp
164–166 ↗(On Diff #483632)

Not sure why you're adding this but it doesn't belong in this patch

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
88–90

Should be able to compute this directly from the existing costs for stack stores

1173–1175

No reason to specially consider them?

1182

Raw argument counts don't correspond to register counts, need to get the type legalized register size

1249–1252

Typo tInlinig

llvm/test/Transforms/Inline/AMDGPU/amdgpu-inline-stack-argument.ll
19–22

Should use pass arguments or flags to set the thresholds to avoid having so many instructions in the test

2028

Don't use anonymous values in tests

This revision now requires changes to proceed.Dec 16 2022, 12:17 PM
scchan requested changes to this revision.Dec 16 2022, 1:23 PM
scchan added a subscriber: scchan.
scchan added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1188

I guess it's subtracting the number of clobbered registers - instead of a hardcoded value, could that be replaced by something more meaningful like a const variable or a getter?

Also shouldn't VGPRs have a higher penalty relative to SGPRs since they'd occupy more stack space?

arsenm added inline comments.Dec 16 2022, 1:25 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1188

We only sort of handle SGPR arguments today, and not for compute. We also do not currently implement the optimization of packing SGPRs into a VGPR for the argument spill

scchan added inline comments.Dec 16 2022, 1:38 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1188

I wasn't paying attention to the comments for ArgStackInlinePenalty. The cost model is only based on the number of instructions and it doesn't take storage into account.

JanekvO updated this revision to Diff 488665.Jan 12 2023, 8:27 AM
JanekvO marked 9 inline comments as done.
  • Address comments
  • Rebase
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
88–90

Sorry, I couldn't find any constant or cl option for stack store costs. Did you have anything in mind to replace this cl option?

1173–1175

Removed as inlining will never call kernel functions as callee.

1188

I couldn't infer what measurement unit the inliner cost/threshold uses so I took a cost model relative from the cost of a single instruction. Do let me know if the storage cost should be considered (and possibly with what amount).

arsenm added inline comments.Jan 12 2023, 8:44 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
88–90

I mean TargetTransformInfo::getMemoryOpCost (looking now and I don't think we ever implemented this. I know I tried to implement at one point but I guess never pushed it. We should implement this too)

1182

Get getPrimitiveSizeInBits doesn't work for pointers. You're also going to be repeating a lot of legalization logic. You're better off using getRegisterTypeForCallingConv and getNumRegistersForCallingConv

scchan added inline comments.Jan 12 2023, 11:54 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1188

I was thinking about the contribution of a stack's size to the overall size of the scratch since that may add penalty to the launch overhead. A VGPR store would take more stack space than a SGPR store and therefore has a higher cost (relatively speaking)? I don't know how to model it at the moment but just suggesting that would be something to consider.

JanekvO updated this revision to Diff 489588.Jan 16 2023, 9:37 AM
JanekvO marked 2 inline comments as done.
  • Address feedback, add tests for ptr and different arg types
arsenm added inline comments.Jan 25 2023, 2:26 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1183–1184

Can you just go through EVT? This is going to not work for vectors of pointers. You shouldn't have to consider any of these details

1212

assign DL to variable above?

1216–1218

This isn't actually true today, we don't implement SGPR argument packing

1224

Don't refer to this as a spill, it's a stack cost

JanekvO updated this revision to Diff 492929.Jan 27 2023, 3:40 PM
JanekvO marked 3 inline comments as done.
  • Review comments
JanekvO marked an inline comment as done.Jan 30 2023, 4:55 AM
JanekvO added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1183–1184

This is going to not work for vectors of pointers.

Sorry, is there a way to get vector of pointers working with EVT that I'm missing?

arsenm added inline comments.Jan 30 2023, 5:57 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1183–1184

You should be performing no logic on the type. You should pass the raw type to getEVT (i.e. remove the isPointerTy check, that doesn't cover vectors of pointers)

JanekvO added inline comments.Jan 30 2023, 7:35 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1183–1184

Ah, the reason I put some type logic here is because getNumRegistersForCallingConv calls EVT::getSizeInBits which is target dependent and will hit an unreachable error (EVT::getEVT also doesn't allow vector of pointers).

arsenm added inline comments.Jan 30 2023, 7:46 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1183–1184

I forgot this API was bad. You probably want getValueType (which is what TTI::getRegUsageForType uses, and also won't work correctly for aggregates). The most correct thing would be to call ComputeValueVTs to cover all aggregates correctly, but that feels a bit heavy. I'd go with getTLI()->getValueType(), and add a test with an array and a struct argument just to make sure those don't break too badly

arsenm added inline comments.Jan 30 2023, 7:47 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1183–1184

Actually, we kind of care a lot about getting aggregates correct. We generate a number of them for the basic HIP ABI and should be emitting more of them.

JanekvO updated this revision to Diff 493907.Feb 1 2023, 4:36 AM
  • Use ComputeValueVTs for argument EVT retrieval.
  • Rebase
arsenm accepted this revision.Feb 1 2023, 6:28 AM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1205

Hardcoding 4 for the alignment would be fine (not that this argument will do anything)

llvm/test/Transforms/Inline/AMDGPU/amdgpu-inline-stack-ptr-argument.ll
2

REQUIRES should be the first line

JanekvO updated this revision to Diff 494267.Feb 2 2023, 5:24 AM
JanekvO marked 2 inline comments as done.
  • Hardcore alignment, switch REQUIRES statement in tests
arsenm accepted this revision.Feb 2 2023, 5:30 AM
JanekvO updated this revision to Diff 494289.Feb 2 2023, 6:39 AM
  • Format fix

Windows failure seems to be unrelated.
@scchan I'm not sure if I've addressed your comments sufficiently, let me know if not.

arsenm accepted this revision.Feb 3 2023, 5:48 AM

You don't need to re-post the review for every minor formatting fix

The SGPR argument thing doesn't really matter now, it's not an optimization we perform for arguments

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1206

Align(4), not MaybeAlign

This revision was not accepted when it landed; it landed in state Needs Review.Feb 3 2023, 7:31 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.

Hi Janek, seeing buildbot failures: https://lab.llvm.org/buildbot/#/builders/193/builds/26047

0. Program arguments: /home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --linker-path=/home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/ld.lld -- -pie -z relro --hash-style=gnu --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o /home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/libomptarget/test/amdgcn-amd-amdhsa/offloading/Output/bug49021.cpp.tmp /lib/x86_64-linux-gnu/Scrt1.o /lib/x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/9/crtbeginS.o -L/home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/libomptarget -L/home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L/usr/lib/gcc/x86_64-linux-gnu/9 -L/usr/lib/gcc/x86_64-linux-gnu/9/../../../../lib64 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/lib -L/usr/lib -rpath /home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/libomptarget -rpath /home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -rpath /home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib /tmp/lit-tmp-g5sdw5re/bug49021-57cfcf.o -lstdc++ -lm -lomp -lomptarget -lomptarget.devicertl -L/home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/lib -lgcc_s -lgcc -lpthread -lc -lgcc_s -lgcc /usr/lib/gcc/x86_64-linux-gnu/9/crtendS.o /lib/x86_64-linux-gnu/crtn.o
#0 0x000055aaa775a69f llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang-linker-wrapper+0x150a69f)
#1 0x000055aaa7758324 SignalHandler(int) Signals.cpp:0:0
#2 0x00007f4ad77ff420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
#3 0x000055aaa661d96a llvm::GCNTTIImpl::adjustInliningThreshold(llvm::CallBase const*) const (.cold) AMDGPUTargetTransformInfo.cpp:0:0
#4 0x000055aaa73dcf01 (anonymous namespace)::InlineCostCallAnalyzer::updateThreshold(llvm::CallBase&, llvm::Function&) InlineCost.cpp:0:0
#5 0x000055aaa73de0ab (anonymous namespace)::InlineCostCallAnalyzer::onAnalysisStart() InlineCost.cpp:0:0