This is an archive of the discontinued LLVM Phabricator instance.

[SROA] Limit the number of allowed slices when trying to split allocas
ClosedPublic

Authored by 0xdc03 on Sep 1 2023, 5:55 AM.

Details

Summary

This patch adds a hidden CLI option "--sroa-max-alloca-slices", which is
an integer that controls the maximum number of alloca slices SROA can
consider before bailing out. This is useful because it may not be
profitable to split memcpys into (possibly tens of) thousands of loads/stores.
This also prevents an issue with exponential compile time explosion in passes
like DSE and MemCpyOpt caused by excessive alloca splitting.

Fixes https://github.com/rust-lang/rust/issues/88580.

Diff Detail

Event Timeline

0xdc03 created this revision.Sep 1 2023, 5:55 AM
0xdc03 requested review of this revision.Sep 1 2023, 5:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2023, 5:55 AM
nikic added inline comments.Sep 1 2023, 6:15 AM
llvm/lib/Transforms/Scalar/SROA.cpp
127

This limit is probably way too low. I'd expect something on the order of 1024 here, otherwise there will be regressions.

0xdc03 added inline comments.Sep 1 2023, 8:17 AM
llvm/lib/Transforms/Scalar/SROA.cpp
127

This limit is probably way too low. I'd expect something on the order of 1024 here, otherwise there will be regressions.

The reason I put 32 was because I thought 1024 was quite slow... here's some statistics for different values (running opt -O3 on unoptimized IR produced by rustc):

2    : 0.05s user
4    : 0.04s user
8    : 0.06s user
16   : 0.06s user
32   : 0.08s user
64   : 0.05s user
128  : 0.12s user
256  : 0.56s user
512  : 0.59s user
1024 : 6.62s user
2048 : 53.27s user

I tried to run opt without the patch, but I had to give up after 20 minutes. The exponential growth here is quite crazy, but it seems manageable at 512.

nikic added a comment.Sep 5 2023, 3:48 AM

As you are adding the limit as an option, you could also include the motivating test case (but specify a lower limit to not blow up the test output).

llvm/lib/Transforms/Scalar/SROA.cpp
127

I think it's okay if it doesn't fully address the compile-time issue, just get it down to a reasonable range. We can't make this limit too low (at least without a more sophisticated heuristic), because things like unrolled loops might legitimately want to SROA into a fairly large number of values.

The primary justification for this change should IMHO be that it is not really profitable to split up a single memcpy into tens or hundreds of thousands of load+store operations. Avoiding compile-time issues is a nice extra benefit -- but probably also something we can mitigate by additional changes to the passes that actually end up being slow.

0xdc03 updated this revision to Diff 556282.Sep 8 2023, 10:15 AM
  • Address reviewer comments
    • Add a test case
    • Increase default value from 32 to 1024
    • Change patch motivation
0xdc03 edited the summary of this revision. (Show Details)Sep 8 2023, 10:16 AM
nikic accepted this revision.Sep 8 2023, 7:39 PM

LGTM

This revision is now accepted and ready to land.Sep 8 2023, 7:39 PM
alex-t added a subscriber: alex-t.EditedNov 7 2023, 7:41 AM

This change breaks the AMDGPU backend performance. It causes the 85% performance drop in certain rocBLAS benchmarks.

0xdc03 added a comment.Nov 7 2023, 8:10 AM

This change breaks the AMDGPU backend performance. It causes the 85% performance drop in certain rocBLAS benchmarks.
The approach itself seems weird to me. You'd better add a target hook to let the target decide about the thresholds instead of deciding for others.

This is also being discussed over at https://github.com/llvm/llvm-project/issues/69785, see specifically @nikic's comment. If its causing issues for too many users, then it may be a good idea to revert for now.