A regression from when new PM got enabled as default. Functions with a big number of instructions will elide getting inlined but do not consider the cost of passing arguments over stack if there are a lot of function arguments. This patch attempts to add a heuristic for AMDGPU's function calling convention that also considers function arguments passed through the stack.
Details
- Reviewers
arsenm scchan - Group Reviewers
Restricted Project - Commits
- rG142c28ffa132: [AMDGPU] Modify adjustInliningThreshold to also consider the cost of passing…
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
1204 | I added function passing through SGPRs as it was described for non-kernel functions in the AMDGPU backend user guide but the calling convention tablegen file for AMDGPU seems to only use VGPRs for passing arguments. Not sure if the aim is to add function argument passing through SGPRs and I should keep it in or whether to remove this. |
llvm/lib/Analysis/InlineCost.cpp | ||
---|---|---|
164–166 ↗ | (On Diff #483632) | Not sure why you're adding this but it doesn't belong in this patch |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
86–88 | Should be able to compute this directly from the existing costs for stack stores | |
1190–1192 | No reason to specially consider them? | |
1199 | Raw argument counts don't correspond to register counts, need to get the type legalized register size | |
1242–1245 | Typo tInlinig | |
llvm/test/Transforms/Inline/AMDGPU/amdgpu-inline-stack-argument.ll | ||
19–22 | Should use pass arguments or flags to set the thresholds to avoid having so many instructions in the test | |
2028 | Don't use anonymous values in tests |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
1205 | I guess it's subtracting the number of clobbered registers - instead of a hardcoded value, could that be replaced by something more meaningful like a const variable or a getter? Also shouldn't VGPRs have a higher penalty relative to SGPRs since they'd occupy more stack space? |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
1205 | We only sort of handle SGPR arguments today, and not for compute. We also do not currently implement the optimization of packing SGPRs into a VGPR for the argument spill |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
1205 | I wasn't paying attention to the comments for ArgStackInlinePenalty. The cost model is only based on the number of instructions and it doesn't take storage into account. |
- Address comments
- Rebase
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
86–88 | Sorry, I couldn't find any constant or cl option for stack store costs. Did you have anything in mind to replace this cl option? | |
1190–1192 | Removed as inlining will never call kernel functions as callee. | |
1205 | I couldn't infer what measurement unit the inliner cost/threshold uses so I took a cost model relative from the cost of a single instruction. Do let me know if the storage cost should be considered (and possibly with what amount). |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
86–88 | I mean TargetTransformInfo::getMemoryOpCost (looking now and I don't think we ever implemented this. I know I tried to implement at one point but I guess never pushed it. We should implement this too) | |
1199 | Get getPrimitiveSizeInBits doesn't work for pointers. You're also going to be repeating a lot of legalization logic. You're better off using getRegisterTypeForCallingConv and getNumRegistersForCallingConv |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
1205 | I was thinking about the contribution of a stack's size to the overall size of the scratch since that may add penalty to the launch overhead. A VGPR store would take more stack space than a SGPR store and therefore has a higher cost (relatively speaking)? I don't know how to model it at the moment but just suggesting that would be something to consider. |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
1200–1201 | Can you just go through EVT? This is going to not work for vectors of pointers. You shouldn't have to consider any of these details | |
1229 | assign DL to variable above? | |
1233–1235 | This isn't actually true today, we don't implement SGPR argument packing | |
1241 | Don't refer to this as a spill, it's a stack cost |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
1200–1201 |
Sorry, is there a way to get vector of pointers working with EVT that I'm missing? |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
1200–1201 | You should be performing no logic on the type. You should pass the raw type to getEVT (i.e. remove the isPointerTy check, that doesn't cover vectors of pointers) |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
1200–1201 | Ah, the reason I put some type logic here is because getNumRegistersForCallingConv calls EVT::getSizeInBits which is target dependent and will hit an unreachable error (EVT::getEVT also doesn't allow vector of pointers). |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
1200–1201 | I forgot this API was bad. You probably want getValueType (which is what TTI::getRegUsageForType uses, and also won't work correctly for aggregates). The most correct thing would be to call ComputeValueVTs to cover all aggregates correctly, but that feels a bit heavy. I'd go with getTLI()->getValueType(), and add a test with an array and a struct argument just to make sure those don't break too badly |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
1200–1201 | Actually, we kind of care a lot about getting aggregates correct. We generate a number of them for the basic HIP ABI and should be emitting more of them. |
Windows failure seems to be unrelated.
@scchan I'm not sure if I've addressed your comments sufficiently, let me know if not.
You don't need to re-post the review for every minor formatting fix
The SGPR argument thing doesn't really matter now, it's not an optimization we perform for arguments
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
1223 | Align(4), not MaybeAlign |
Hi Janek, seeing buildbot failures: https://lab.llvm.org/buildbot/#/builders/193/builds/26047
0. Program arguments: /home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang-linker-wrapper --host-triple=x86_64-unknown-linux-gnu --linker-path=/home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/ld.lld -- -pie -z relro --hash-style=gnu --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o /home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/libomptarget/test/amdgcn-amd-amdhsa/offloading/Output/bug49021.cpp.tmp /lib/x86_64-linux-gnu/Scrt1.o /lib/x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/9/crtbeginS.o -L/home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/libomptarget -L/home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L/usr/lib/gcc/x86_64-linux-gnu/9 -L/usr/lib/gcc/x86_64-linux-gnu/9/../../../../lib64 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/lib -L/usr/lib -rpath /home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/libomptarget -rpath /home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -rpath /home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib /tmp/lit-tmp-g5sdw5re/bug49021-57cfcf.o -lstdc++ -lm -lomp -lomptarget -lomptarget.devicertl -L/home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/lib -lgcc_s -lgcc -lpthread -lc -lgcc_s -lgcc /usr/lib/gcc/x86_64-linux-gnu/9/crtendS.o /lib/x86_64-linux-gnu/crtn.o
#0 0x000055aaa775a69f llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/omp-vega20-0/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang-linker-wrapper+0x150a69f)
#1 0x000055aaa7758324 SignalHandler(int) Signals.cpp:0:0
#2 0x00007f4ad77ff420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
#3 0x000055aaa661d96a llvm::GCNTTIImpl::adjustInliningThreshold(llvm::CallBase const*) const (.cold) AMDGPUTargetTransformInfo.cpp:0:0
#4 0x000055aaa73dcf01 (anonymous namespace)::InlineCostCallAnalyzer::updateThreshold(llvm::CallBase&, llvm::Function&) InlineCost.cpp:0:0
#5 0x000055aaa73de0ab (anonymous namespace)::InlineCostCallAnalyzer::onAnalysisStart() InlineCost.cpp:0:0
Should be able to compute this directly from the existing costs for stack stores