Apparently it was used to work around some issue that has been fixed.
Removing it helps with high scratch usage observed in some cases due to failed alloca promotion.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Comments
llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll | ||
---|---|---|
86 | I added some i128 test cases back in. Interestingly, 9xi128 is converted but not 17xi64 for some reason, do you know why? |
llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll | ||
---|---|---|
86 | Oh right, we don't convert >16 elt arrays. That's why the i128 test case was useful, my bad - should have looked more into it |
Increasing the limit may result in more spilling in other cases. In general a good performance testing is needed to reason if this is beneficial.
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp | ||
---|---|---|
182 | Isn't this still true? You might not see alloca, but you will see spills. | |
llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll | ||
6 | I would not drop or change tests, but rather add new as needed. |
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp | ||
---|---|---|
182 | Trading spills inside a function for CSR spills can be profitable, such as if the spilling occurs in a loop. IIRC this was to workaround some compile failures |
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp | ||
---|---|---|
182 | One reason for the current limits were 'run of registers' errors in some cases. I'd be really careful extending the limits. |
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp | ||
---|---|---|
182 | I would be okay with leaving the limit at 1/4 (or raising it to 1/3 instead of 1/2), as long as the entry function CC exception is removed. That should be enough for the particular case I'm interested in. |
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp | ||
---|---|---|
182 |
If Matt is OK trading alloca to spilling let it be. | |
421 | It does not make sense to me to do things differently if -amdgpu-promote-alloca-to-vector-limit is used. This option is not for users, it is for us to debug the pass and experiment. Changing pass behavior while debugging does not help. |
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp | ||
---|---|---|
182 | This pass ought to be smarter about the context and whether the vector would be live across calls. For this particular case, it’s apparently a pass ordering issue. It seems we run this before inlining which seems wrong |
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp | ||
---|---|---|
421 | Would you prefer if I left the factor to 1 so promote-alloca-to-vector-limit becomes an absolute value? |
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp | ||
---|---|---|
421 | It is absolute value, maximum size of alloca in bytes. But actually it looks like it still does the same. It is just the readability of this is off. But OK, resolved. |
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp | ||
---|---|---|
422 | I think it would be a bit easier to understand if this was all done in terms of bytes, and then checked against getTypeStoreSize |
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp | ||
---|---|---|
431–436 | This whole size logic doesn't make sense when viewed in total. The element count check should be considered along with the size. I also don't think using fractions of the register budget makes sense when deciding for an individual alloca. I think it would be easier to follow if we used the 32-bit element count limit as 16 for budgets < 64, and 32 for > 64. Thinking about byte sizes doesn't really make sense either. For sub-32-bit elements, don't we end up promoting them to 32-bit element vectors anyway? |
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp | ||
---|---|---|
431–436 | I'm not sure i understand, do you mean that for 32-bit elements and up we should use (ByteLimit < 64) ? 16 : 32 as the maximum number of elements, and something different for <32-bit element arrays? Note that the current patch, as is, doesn't change the limit other than removing the CC exception + adding some NFC changes. FWIW, I was planning to do a big set of changes to this pass soon™ starting with changing its structure so it builds a worklist before doing the transformation so we can consider the function as a whole. If that goes through then whatever limit we set here may get removed anyway. |
Refactor this patch so it just removes the CC limit.
I think it's the least controversial change and it's the one that needs to go in ASAP.
I'll look into updating the limits in a separate patch, maybe when I do my bigger set of PromoteAlloca changes.
Isn't this still true? You might not see alloca, but you will see spills.