This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Remove CC exception for Promote Alloca Limits
ClosedPublic

Authored by Pierre-vh on Mar 8 2023, 7:31 AM.

Details

Summary

Apparently it was used to work around some issue that has been fixed.
Removing it helps with high scratch usage observed in some cases due to failed alloca promotion.

Diff Detail

Event Timeline

Pierre-vh created this revision.Mar 8 2023, 7:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2023, 7:31 AM
Pierre-vh requested review of this revision.Mar 8 2023, 7:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2023, 7:31 AM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
421

Leftover debug print

llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll
86

I wouldn't expect deleting these cases

Pierre-vh updated this revision to Diff 503368.Mar 8 2023, 7:42 AM
Pierre-vh marked 2 inline comments as done.

Comments

llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll
86

I added some i128 test cases back in. Interestingly, 9xi128 is converted but not 17xi64 for some reason, do you know why?

Pierre-vh added inline comments.Mar 8 2023, 7:43 AM
llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll
86

Oh right, we don't convert >16 elt arrays. That's why the i128 test case was useful, my bad - should have looked more into it

Pierre-vh updated this revision to Diff 503371.Mar 8 2023, 7:46 AM

Add more i128 + i256 test cases to not hit the 16 element limit

Increasing the limit may result in more spilling in other cases. In general a good performance testing is needed to reason if this is beneficial.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182

Isn't this still true? You might not see alloca, but you will see spills.

llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll
6

I would not drop or change tests, but rather add new as needed.

arsenm added inline comments.Mar 8 2023, 12:00 PM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182

Trading spills inside a function for CSR spills can be profitable, such as if the spilling occurs in a loop.

IIRC this was to workaround some compile failures

rampitec added inline comments.Mar 8 2023, 12:02 PM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182

One reason for the current limits were 'run of registers' errors in some cases. I'd be really careful extending the limits.

Pierre-vh updated this revision to Diff 503670.Mar 9 2023, 12:43 AM
Pierre-vh marked an inline comment as done.

Restore older tests, restore limit to 1/4 when the CL option is used

Pierre-vh added inline comments.Mar 9 2023, 12:45 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182

I would be okay with leaving the limit at 1/4 (or raising it to 1/3 instead of 1/2), as long as the entry function CC exception is removed. That should be enough for the particular case I'm interested in.

rampitec added inline comments.Mar 9 2023, 11:39 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182

I would be okay with leaving the limit at 1/4 (or raising it to 1/3 instead of 1/2), as long as the entry function CC exception is removed. That should be enough for the particular case I'm interested in.

If Matt is OK trading alloca to spilling let it be.

421

It does not make sense to me to do things differently if -amdgpu-promote-alloca-to-vector-limit is used. This option is not for users, it is for us to debug the pass and experiment. Changing pass behavior while debugging does not help.

arsenm added inline comments.Mar 9 2023, 12:05 PM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182

This pass ought to be smarter about the context and whether the vector would be live across calls.

For this particular case, it’s apparently a pass ordering issue. It seems we run this before inlining which seems wrong

Pierre-vh added inline comments.Mar 13 2023, 12:22 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
421

Would you prefer if I left the factor to 1 so promote-alloca-to-vector-limit becomes an absolute value?
It'd change the behavior from before (because it was x4) but may make the option easier to understand.

rampitec added inline comments.Mar 13 2023, 3:06 PM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
421

It is absolute value, maximum size of alloca in bytes. But actually it looks like it still does the same. It is just the readability of this is off. But OK, resolved.

Pierre-vh marked 2 inline comments as done.Mar 17 2023, 12:48 AM

Unrelated but PromoteAllocaToVectorLimit should really move to a new PM pass parameter

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
416

Half feels pretty aggressive

417–419

The largest register class we support is <32 x i32>, do we definitely never introduce larger vectors?

Pierre-vh marked an inline comment as done.Mar 27 2023, 12:28 AM
Pierre-vh added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
416

What's a better limit? 1/3 (but then we get odd numbers) ?
Or do we leave it at 1/2 and just remove the CC limit?

417–419

I think anything not supported by the DAG will get split up in pieces that are supported, no?

arsenm added inline comments.Mar 27 2023, 5:34 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
422

I think it would be a bit easier to understand if this was all done in terms of bytes, and then checked against getTypeStoreSize

Pierre-vh updated this revision to Diff 508608.Mar 27 2023, 5:47 AM

Use byte instead of bits for limit

Should I lower the limit back to 1/4 for all cases for this to be land-able?

arsenm added inline comments.Apr 3 2023, 6:47 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
431–436

This whole size logic doesn't make sense when viewed in total. The element count check should be considered along with the size. I also don't think using fractions of the register budget makes sense when deciding for an individual alloca. I think it would be easier to follow if we used the 32-bit element count limit as 16 for budgets < 64, and 32 for > 64.

Thinking about byte sizes doesn't really make sense either. For sub-32-bit elements, don't we end up promoting them to 32-bit element vectors anyway?

Pierre-vh added inline comments.Apr 3 2023, 8:37 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
431–436

I'm not sure i understand, do you mean that for 32-bit elements and up we should use (ByteLimit < 64) ? 16 : 32 as the maximum number of elements, and something different for <32-bit element arrays?

Note that the current patch, as is, doesn't change the limit other than removing the CC exception + adding some NFC changes.
As this is blocking an issue it'd be nice if we could land the minimum changes needed to fix it soon; then I can make another patch to make bigger changes to how the limit is decided. Would that work for you? I could also remove some changes from this patch to make it strictly about removing the CC limit;

FWIW, I was planning to do a big set of changes to this pass soon™ starting with changing its structure so it builds a worklist before doing the transformation so we can consider the function as a whole. If that goes through then whatever limit we set here may get removed anyway.

Pierre-vh edited the summary of this revision. (Show Details)Apr 12 2023, 12:57 AM
Pierre-vh planned changes to this revision.Apr 12 2023, 3:56 AM
Pierre-vh updated this revision to Diff 512766.Apr 12 2023, 4:05 AM

Refactor this patch so it just removes the CC limit.
I think it's the least controversial change and it's the one that needs to go in ASAP.
I'll look into updating the limits in a separate patch, maybe when I do my bigger set of PromoteAlloca changes.

Pierre-vh retitled this revision from [AMDGPU] Tweak PromoteAlloca limits to [AMDGPU] Remove CC exception for Promote Alloca Limits.Apr 12 2023, 4:05 AM
Pierre-vh edited the summary of this revision. (Show Details)
This revision is now accepted and ready to land.Apr 12 2023, 11:35 AM
This revision was automatically updated to reflect the committed changes.