This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions in the caller (2/2)
ClosedPublic

Authored by jmmartinez on May 3 2023, 5:23 AM.

Details

Summary

Before this patch, the compiler gave a bump to the inline-threshold
when the total size of the allocas passed as arguments to the
callee was below 256 bytes.
This heuristic ignores that some of these allocas could have be removed
by SROA if inlining was applied.

Ideally, this bonus would be attributed to the threshold once the
size of all the allocas that could not be handled by SROA is known:
at the end of the InlineCost analysis.
However, we may never reach this point if the inline-cost analysis exits
early when the inline cost goes over the threshold mid-analysis.

This patch proposes:

  • Attribute the bonus in the inline-threshold when allocas are passed as arguments (regardless of their total size).
  • Assigns a cost to each alloca proportional to its size, such that the cost of all the allocas cancels the bonus.

Potential problems:

  • This patch assumes that removing alloca instructions with SROA is always profitable. This may not be the case if the total size of the allocas is still too big to be promoted to registers/LDS.
  • Redundant calls to getTotalAllocaSize
  • Awkwardly, the threshold attributed contributes to the single-bb and vector bonus.

Diff Detail

Event Timeline

jmmartinez created this revision.May 3 2023, 5:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 5:23 AM
jmmartinez requested review of this revision.May 3 2023, 5:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 5:23 AM

I started this 3-patch series to quick the discussion about how we could take into account the interaction between SROA and the Inliner in AMDGPU.

Currently, we may avoid inlining functions where inlining might be profitable due to SROA being applied. This patch tries to take that into account.

Pierre-vh added inline comments.May 10 2023, 11:56 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1221–1222

The way I understand this function is that it does a sum of the size of all the allocas used to pass arguments to a function, but only takes allocas that are in flat/private into account
If that's correct, I would rename the function to something more like getCallArgsTotalAllocaSize to reflect it.

Also nit: we seem to generally use unsigned for size types in LLVM, I also prefer size_t but I would just stay consistent with the unsigned below and also use unsigned here

1222–1226

Comment probably needs updating - this function now just calculate the size of all allocas, it doesn't adjust the inlining threshold

1252–1254
1267–1268

This is called during inlining cost calculations right? So I'd rephrase it as "that may be SROA'd" - it currently reads as this is being used by SROA

1274

Is it an issue? Can you elaborate a bit more?

llvm/test/Transforms/Inline/AMDGPU/amdgpu-inline-alloca-argument-cost.ll
97

newline at end of file :)
(I also had this a lot until I found a setting in my IDE to automatically insert it on save, if you use VSCode it's easy)

jmmartinez marked 4 inline comments as done.
  • Remarks taken into account
jmmartinez added inline comments.May 12 2023, 2:36 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1274

After adding the inline threshold bonus that comes from adjustInliningThreshold, the threshold gets multiplied by threshold-multiplier, the single-bb bonus, and the vector bonus.

The cost assigned to each alloca assumes that

Cost_Alloca_0 + ... + Cost_Alloca_N == ( ArgAllocaCost * threshold-multiplier )

But it doesn't take into acount the single-bb bonus and the vector bonus.

This may give an inlining advantage to functions with a single-bb or with vector instructions that was not there before.

The single-bb could be easily fixed tough.

I feel that this patch tries to fit the problem to the solution rather than the opposite :S

llvm/test/Transforms/Inline/AMDGPU/amdgpu-inline-alloca-argument-cost.ll
8

Fix comment.

97

Thanks for the tip!

scchan added inline comments.May 18 2023, 7:05 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1226–1228

Should we change this to unsigned too?

1264

It's not only private arrays but it also includes private objects which have its address taken and passed to the callee as argument?

scchan added inline comments.May 18 2023, 8:47 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1280

We should add your reasoning above to the comment. I think changing this to (ArgAllocaCost * getInliningThresholdMultiplier()) * (ArgAllocaSize/AllocaSize) would make it more apparent that you are getting a fraction of that bonus.

jmmartinez marked 2 inline comments as done.
  • Taking into account remarks
  • Taking into account single-bb and vector bonuses
jmmartinez added inline comments.May 19 2023, 2:46 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1264

Yes, it applies to private objects in general. I've updated the comments to match.

1280

In that case I would have to cast the division to floating point (in my opinion, not a problem) since ArgAllocaSize / AllocaSize is always 0 on integers.

Would that be ok?

scchan accepted this revision.Jun 8 2023, 12:23 PM

LGTM

This revision is now accepted and ready to land.Jun 8 2023, 12:23 PM