This patch adds a hidden CLI option "--sroa-max-alloca-slices", which is
an integer that controls the maximum number of alloca slices SROA can
consider before bailing out. This is useful because it may not be
profitable to split memcpys into (possibly tens of) thousands of loads/stores.
This also prevents an issue with exponential compile time explosion in passes
like DSE and MemCpyOpt caused by excessive alloca splitting.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Transforms/Scalar/SROA.cpp | ||
---|---|---|
127 | This limit is probably way too low. I'd expect something on the order of 1024 here, otherwise there will be regressions. |
llvm/lib/Transforms/Scalar/SROA.cpp | ||
---|---|---|
127 |
The reason I put 32 was because I thought 1024 was quite slow... here's some statistics for different values (running opt -O3 on unoptimized IR produced by rustc): 2 : 0.05s user 4 : 0.04s user 8 : 0.06s user 16 : 0.06s user 32 : 0.08s user 64 : 0.05s user 128 : 0.12s user 256 : 0.56s user 512 : 0.59s user 1024 : 6.62s user 2048 : 53.27s user I tried to run opt without the patch, but I had to give up after 20 minutes. The exponential growth here is quite crazy, but it seems manageable at 512. |
As you are adding the limit as an option, you could also include the motivating test case (but specify a lower limit to not blow up the test output).
llvm/lib/Transforms/Scalar/SROA.cpp | ||
---|---|---|
127 | I think it's okay if it doesn't fully address the compile-time issue, just get it down to a reasonable range. We can't make this limit too low (at least without a more sophisticated heuristic), because things like unrolled loops might legitimately want to SROA into a fairly large number of values. The primary justification for this change should IMHO be that it is not really profitable to split up a single memcpy into tens or hundreds of thousands of load+store operations. Avoiding compile-time issues is a nice extra benefit -- but probably also something we can mitigate by additional changes to the passes that actually end up being slow. |
- Address reviewer comments
- Add a test case
- Increase default value from 32 to 1024
- Change patch motivation
This change breaks the AMDGPU backend performance. It causes the 85% performance drop in certain rocBLAS benchmarks.
This is also being discussed over at https://github.com/llvm/llvm-project/issues/69785, see specifically @nikic's comment. If its causing issues for too many users, then it may be a good idea to revert for now.
This limit is probably way too low. I'd expect something on the order of 1024 here, otherwise there will be regressions.